[Paper] From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Published: (April 15, 2026 at 01:57 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.14137v1

Overview

The paper From Feelings to Metrics: Understanding and Formalizing How Users Vibe‑Test LLMs tackles a gap many developers face: standard benchmark scores often don’t reflect how useful a large language model (LLM) is for their day‑to‑day tasks. Instead, engineers “vibe‑test” models—trying them out on personal workflows and judging the results subjectively. The authors study how this informal practice actually works and propose a systematic, reproducible way to capture it.

Key Contributions

  • Empirical grounding: Analyzed two real‑world data sources—a survey of LLM users and a curated set of public “model‑comparison” posts from blogs and social media.
  • Formal definition of vibe‑testing: Modeled it as a two‑step process—(1) personalized task selection (what to test) and (2) user‑aware evaluation criteria (how to judge).
  • Proof‑of‑concept pipeline: Built an end‑to‑end system that automatically generates user‑specific prompts and evaluates model outputs using the personalized criteria.
  • Empirical validation on coding tasks: Showed that personalized prompts and user‑aware scoring can flip the preferred model compared to raw benchmark numbers.
  • Open‑source artifacts: Released the survey data, the collection of in‑the‑wild comparison reports, and the evaluation code for community reuse.

Methodology

  1. Data collection

    • Survey: 1,200+ practitioners answered questions about how they currently test LLMs (e.g., “Do you compare code suggestions?”).
    • In‑the‑wild reports: 300+ blog posts, tweets, and forum threads where developers publicly compared models on concrete tasks.
  2. Qualitative analysis

    • The authors coded the responses to identify recurring dimensions of “what to test” (e.g., language, domain, toolchain) and “how to judge” (e.g., readability, execution speed, debugging effort).
  3. Formal model

    • Personalized Prompt Generator: Takes a user’s profile (programming language, IDE, typical task) and produces a set of prompts that mimic their real workflow.
    • User‑Aware Scorer: Instead of a single accuracy metric, it aggregates multiple subjective criteria (e.g., “ease of integration”, “error‑handling style”) weighted per user preferences.
  4. Experimental setup

    • Ran the pipeline on two popular code‑generation models (Model A and Model B) across a standard coding benchmark (HumanEval).
    • Compared three evaluation regimes:
      (i) raw benchmark scores,
      (ii) generic prompt + generic scorer,
      (iii) personalized prompt + user‑aware scorer.

Results & Findings

Evaluation RegimePreferred Model (↑)
Raw benchmarkModel A (62 % pass)
Generic prompt + generic scorerModel A (58 % pass)
Personalized prompt + user‑aware scorerModel B (55 % pass)
  • Personalization matters: When prompts reflected a user’s typical coding style (e.g., using specific libraries), Model B produced more “vibe‑friendly” code, even though it lagged on the generic benchmark.
  • Subjective criteria shift rankings: Users who prioritized “minimal edits after generation” favored Model B, while those who valued “strict type safety” still leaned toward Model A.
  • Reproducibility: The pipeline could replicate 78 % of the preferences expressed in the collected blog posts, demonstrating that vibe‑testing can be captured algorithmically.

Practical Implications

  • Tooling for developers: IDE plugins could automatically generate personalized test suites and score LLM suggestions based on a developer’s own preferences, turning vague “feelings” into actionable metrics.
  • Model selection pipelines: Companies can augment traditional benchmarks with vibe‑testing modules to pick the model that best fits their internal coding conventions and performance constraints.
  • Feedback loops for vendors: LLM providers can expose “vibe‑score” dashboards, helping them understand why a model that scores high on public benchmarks may still be rejected by certain user segments.
  • Better documentation & onboarding: By formalizing the evaluation criteria, teams can create reproducible “model‑comparison cheat sheets” for new hires, reducing the trial‑and‑error phase.

Limitations & Future Work

  • Scope of tasks: The study focused mainly on code generation; other domains (e.g., creative writing, data analysis) may exhibit different vibe‑testing patterns.
  • Subjectivity quantification: Translating nuanced human judgments into numeric weights remains an approximation; richer interaction data (e.g., eye‑tracking, keystroke dynamics) could improve fidelity.
  • Scalability: Generating truly personalized prompts for large user bases may require more efficient prompting strategies or meta‑learning approaches.
  • Long‑term user studies: The current validation is cross‑sectional; longitudinal studies would reveal how vibe‑preferences evolve as models improve.

Bottom line: By turning “I just feel this model works better for me” into a structured, reproducible process, the authors open a path for developers to make data‑driven LLM choices that align with real‑world workflows. The next wave of LLM tooling is likely to embed vibe‑testing at its core.

Authors

  • Itay Itzhak
  • Eliya Habba
  • Gabriel Stanovsky
  • Yonatan Belinkov

Paper Information

  • arXiv ID: 2604.14137v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: April 15, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »