[Paper] From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Published: 3 weeks ago (April 15, 2026 at 01:57 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.14137v1

Overview

The paper From Feelings to Metrics: Understanding and Formalizing How Users Vibe‑Test LLMs tackles a gap many developers face: standard benchmark scores often don’t reflect how useful a large language model (LLM) is for their day‑to‑day tasks. Instead, engineers “vibe‑test” models—trying them out on personal workflows and judging the results subjectively. The authors study how this informal practice actually works and propose a systematic, reproducible way to capture it.

Key Contributions

Empirical grounding: Analyzed two real‑world data sources—a survey of LLM users and a curated set of public “model‑comparison” posts from blogs and social media.
Formal definition of vibe‑testing: Modeled it as a two‑step process—(1) personalized task selection (what to test) and (2) user‑aware evaluation criteria (how to judge).
Proof‑of‑concept pipeline: Built an end‑to‑end system that automatically generates user‑specific prompts and evaluates model outputs using the personalized criteria.
Empirical validation on coding tasks: Showed that personalized prompts and user‑aware scoring can flip the preferred model compared to raw benchmark numbers.
Open‑source artifacts: Released the survey data, the collection of in‑the‑wild comparison reports, and the evaluation code for community reuse.

Methodology

Data collection
- Survey: 1,200+ practitioners answered questions about how they currently test LLMs (e.g., “Do you compare code suggestions?”).
- In‑the‑wild reports: 300+ blog posts, tweets, and forum threads where developers publicly compared models on concrete tasks.
Qualitative analysis
- The authors coded the responses to identify recurring dimensions of “what to test” (e.g., language, domain, toolchain) and “how to judge” (e.g., readability, execution speed, debugging effort).
Formal model
- Personalized Prompt Generator: Takes a user’s profile (programming language, IDE, typical task) and produces a set of prompts that mimic their real workflow.
- User‑Aware Scorer: Instead of a single accuracy metric, it aggregates multiple subjective criteria (e.g., “ease of integration”, “error‑handling style”) weighted per user preferences.
Experimental setup
- Ran the pipeline on two popular code‑generation models (Model A and Model B) across a standard coding benchmark (HumanEval).
- Compared three evaluation regimes:
  (i) raw benchmark scores,
  (ii) generic prompt + generic scorer,
  (iii) personalized prompt + user‑aware scorer.

Results & Findings

Evaluation Regime	Preferred Model (↑)
Raw benchmark	Model A (62 % pass)
Generic prompt + generic scorer	Model A (58 % pass)
Personalized prompt + user‑aware scorer	Model B (55 % pass)

Personalization matters: When prompts reflected a user’s typical coding style (e.g., using specific libraries), Model B produced more “vibe‑friendly” code, even though it lagged on the generic benchmark.
Subjective criteria shift rankings: Users who prioritized “minimal edits after generation” favored Model B, while those who valued “strict type safety” still leaned toward Model A.
Reproducibility: The pipeline could replicate 78 % of the preferences expressed in the collected blog posts, demonstrating that vibe‑testing can be captured algorithmically.

Practical Implications

Tooling for developers: IDE plugins could automatically generate personalized test suites and score LLM suggestions based on a developer’s own preferences, turning vague “feelings” into actionable metrics.
Model selection pipelines: Companies can augment traditional benchmarks with vibe‑testing modules to pick the model that best fits their internal coding conventions and performance constraints.
Feedback loops for vendors: LLM providers can expose “vibe‑score” dashboards, helping them understand why a model that scores high on public benchmarks may still be rejected by certain user segments.
Better documentation & onboarding: By formalizing the evaluation criteria, teams can create reproducible “model‑comparison cheat sheets” for new hires, reducing the trial‑and‑error phase.

Limitations & Future Work

Scope of tasks: The study focused mainly on code generation; other domains (e.g., creative writing, data analysis) may exhibit different vibe‑testing patterns.
Subjectivity quantification: Translating nuanced human judgments into numeric weights remains an approximation; richer interaction data (e.g., eye‑tracking, keystroke dynamics) could improve fidelity.
Scalability: Generating truly personalized prompts for large user bases may require more efficient prompting strategies or meta‑learning approaches.
Long‑term user studies: The current validation is cross‑sectional; longitudinal studies would reveal how vibe‑preferences evolve as models improve.

Bottom line: By turning “I just feel this model works better for me” into a structured, reproducible process, the authors open a path for developers to make data‑driven LLM choices that align with real‑world workflows. The next wave of LLM tooling is likely to embed vibe‑testing at its core.

Authors

Itay Itzhak
Eliya Habba
Gabriel Stanovsky
Yonatan Belinkov

Paper Information

arXiv ID: 2604.14137v1
Categories: cs.CL, cs.AI, cs.LG
Published: April 15, 2026
PDF: Download PDF

[Paper] From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

[Paper] Detecting and Suppressing Reward Hacking with Gradient Fingerprints