[Paper] Measuring all the noises of LLM Evals
Source: arXiv - 2512.21326v1
Overview
The paper “Measuring all the noises of LLM Evals” tackles a surprisingly practical problem: how to tell whether differences you see between large language models (LLMs) are real or just random fluctuations. By rigorously defining and quantifying three distinct sources of “noise” in model evaluations, the authors give developers a statistical toolbox that can be applied out‑of‑the‑box to a wide range of benchmark tests.
Key Contributions
- Formal taxonomy of evaluation noise – separates prediction noise (variability in a model’s answer to the same prompt), data noise (variability from the sampled set of prompts), and the total noise that combines both via the law of total variance.
- All‑pairs paired analysis – a scalable method that simultaneously performs paired statistical tests for every pair of LLMs in a study, leveraging millions of individual predictions.
- Empirical noise atlas – measurements across dozens of popular LLMs, tasks (e.g., QA, summarization, code generation), and evaluation settings, revealing consistent patterns in noise magnitude.
- Practical guidelines – shows that prediction noise usually dominates data noise, so averaging multiple runs (e.g., using temperature‑0 or majority voting) can dramatically boost statistical power.
- Open‑source tooling – the authors release code that automates noise estimation and significance testing, requiring no custom statistical expertise.
Methodology
-
Define noise components
- Prediction noise: For a fixed prompt, run the model multiple times (different random seeds, temperature settings) and record the variance of the scores.
- Data noise: Sample many prompts from the benchmark and compute variance across prompts for a single deterministic model run.
- Total noise: Apply the law of total variance →
Var(total) = E[Var(prediction|prompt)] + Var(E[prediction|prompt]).
-
All‑pairs paired framework
- For N models, generate predictions for the same set of M prompts, repeating each prompt R times per model.
- Construct paired differences for every model pair (i, j) on each prompt and each repeat, yielding a massive matrix of differences.
- Use standard paired‑t or Wilcoxon tests on this matrix, but because every pair shares the same underlying data, the variance estimates are pooled, giving far tighter confidence intervals.
-
Large‑scale measurement
- The authors run the pipeline on 10+ public LLM families (GPT‑3.5, LLaMA, Claude, etc.) across 15 benchmark suites, totaling > 10 M prompt‑model‑run triples.
- Noise estimates are then aggregated to produce per‑benchmark “noise fingerprints”.
Results & Findings
| Finding | What the numbers say |
|---|---|
| Benchmark‑specific total noise is stable | Across model pairs, the total variance for a given benchmark varies by < 5 % – indicating a characteristic “noise floor” per task. |
| Prediction noise > data noise | On average, prediction noise accounts for ~60‑70 % of total variance, while data noise contributes ~30‑40 %. |
| Averaging reduces noise dramatically | Running a model 5 times and averaging scores cuts prediction noise by ~80 %, turning a previously non‑significant 2 % performance gap into a statistically robust 5 σ effect. |
| All‑pairs paired test outperforms naïve t‑tests | For the same data, the paired approach yields confidence intervals ~2× narrower, enabling detection of effect sizes as small as 0.5 % absolute accuracy improvement. |
These patterns hold across domains (text, code, reasoning) and model scales, suggesting the findings are not idiosyncratic to a single architecture.
Practical Implications
- Rapid significance checks – Developers can plug the released library into their CI pipelines to automatically flag whether a new model version truly outperforms the previous one, without writing custom statistical code.
- Cost‑effective evaluation – Knowing that prediction noise dominates means you can invest compute in few‑shot averaging (e.g., 3‑5 runs per prompt) rather than expanding the benchmark size, saving API costs while gaining statistical power.
- Benchmark design – When creating a new test set, aim for prompts that minimize data noise (e.g., balanced difficulty) because the remaining variance will be mostly prediction‑driven and thus controllable.
- Model debugging – If a model’s prediction noise spikes on a particular task, it may indicate instability in the decoding strategy (temperature, top‑k) or a need for better prompt engineering.
- Research reproducibility – By reporting the three noise components alongside performance numbers, papers can give readers a clear picture of how “tight” the results are, reducing the risk of over‑claiming marginal gains.
Limitations & Future Work
- Scope of benchmarks – The study focuses on standard academic and industry benchmarks; highly interactive or multimodal tasks (e.g., vision‑language) may exhibit different noise structures.
- Assumption of independence – The paired analysis treats each prompt‑run as independent; in practice, shared system caches or API throttling could introduce subtle correlations.
- Temperature‑0 baseline – While averaging reduces prediction noise, the paper does not explore the trade‑off between diversity (higher temperature) and statistical power for downstream user‑facing applications.
- Future directions – Extending the noise taxonomy to human‑in‑the‑loop evaluations, integrating Bayesian hierarchical models for even tighter uncertainty estimates, and building a public “noise leaderboard” for emerging LLMs.
Authors
- Sida Wang
Paper Information
- arXiv ID: 2512.21326v1
- Categories: cs.LG, cs.AI, cs.CL, stat.ML
- Published: December 24, 2025
- PDF: Download PDF