[Paper] Measuring all the noises of LLM Evals

Published: 1 month ago (December 24, 2025 at 01:54 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21326v1

Overview

The paper “Measuring all the noises of LLM Evals” tackles a surprisingly practical problem: how to tell whether differences you see between large language models (LLMs) are real or just random fluctuations. By rigorously defining and quantifying three distinct sources of “noise” in model evaluations, the authors give developers a statistical toolbox that can be applied out‑of‑the‑box to a wide range of benchmark tests.

Key Contributions

Formal taxonomy of evaluation noise – separates prediction noise (variability in a model’s answer to the same prompt), data noise (variability from the sampled set of prompts), and the total noise that combines both via the law of total variance.
All‑pairs paired analysis – a scalable method that simultaneously performs paired statistical tests for every pair of LLMs in a study, leveraging millions of individual predictions.
Empirical noise atlas – measurements across dozens of popular LLMs, tasks (e.g., QA, summarization, code generation), and evaluation settings, revealing consistent patterns in noise magnitude.
Practical guidelines – shows that prediction noise usually dominates data noise, so averaging multiple runs (e.g., using temperature‑0 or majority voting) can dramatically boost statistical power.
Open‑source tooling – the authors release code that automates noise estimation and significance testing, requiring no custom statistical expertise.

Methodology

Define noise components
- Prediction noise: For a fixed prompt, run the model multiple times (different random seeds, temperature settings) and record the variance of the scores.
- Data noise: Sample many prompts from the benchmark and compute variance across prompts for a single deterministic model run.
- Total noise: Apply the law of total variance → Var(total) = E[Var(prediction|prompt)] + Var(E[prediction|prompt]).
All‑pairs paired framework
- For N models, generate predictions for the same set of M prompts, repeating each prompt R times per model.
- Construct paired differences for every model pair (i, j) on each prompt and each repeat, yielding a massive matrix of differences.
- Use standard paired‑t or Wilcoxon tests on this matrix, but because every pair shares the same underlying data, the variance estimates are pooled, giving far tighter confidence intervals.
Large‑scale measurement
- The authors run the pipeline on 10+ public LLM families (GPT‑3.5, LLaMA, Claude, etc.) across 15 benchmark suites, totaling > 10 M prompt‑model‑run triples.
- Noise estimates are then aggregated to produce per‑benchmark “noise fingerprints”.

Results & Findings

Finding	What the numbers say
Benchmark‑specific total noise is stable	Across model pairs, the total variance for a given benchmark varies by < 5 % – indicating a characteristic “noise floor” per task.
Prediction noise > data noise	On average, prediction noise accounts for ~60‑70 % of total variance, while data noise contributes ~30‑40 %.
Averaging reduces noise dramatically	Running a model 5 times and averaging scores cuts prediction noise by ~80 %, turning a previously non‑significant 2 % performance gap into a statistically robust 5 σ effect.
All‑pairs paired test outperforms naïve t‑tests	For the same data, the paired approach yields confidence intervals ~2× narrower, enabling detection of effect sizes as small as 0.5 % absolute accuracy improvement.

These patterns hold across domains (text, code, reasoning) and model scales, suggesting the findings are not idiosyncratic to a single architecture.

Practical Implications

Rapid significance checks – Developers can plug the released library into their CI pipelines to automatically flag whether a new model version truly outperforms the previous one, without writing custom statistical code.
Cost‑effective evaluation – Knowing that prediction noise dominates means you can invest compute in few‑shot averaging (e.g., 3‑5 runs per prompt) rather than expanding the benchmark size, saving API costs while gaining statistical power.
Benchmark design – When creating a new test set, aim for prompts that minimize data noise (e.g., balanced difficulty) because the remaining variance will be mostly prediction‑driven and thus controllable.
Model debugging – If a model’s prediction noise spikes on a particular task, it may indicate instability in the decoding strategy (temperature, top‑k) or a need for better prompt engineering.
Research reproducibility – By reporting the three noise components alongside performance numbers, papers can give readers a clear picture of how “tight” the results are, reducing the risk of over‑claiming marginal gains.

Limitations & Future Work

Scope of benchmarks – The study focuses on standard academic and industry benchmarks; highly interactive or multimodal tasks (e.g., vision‑language) may exhibit different noise structures.
Assumption of independence – The paired analysis treats each prompt‑run as independent; in practice, shared system caches or API throttling could introduce subtle correlations.
Temperature‑0 baseline – While averaging reduces prediction noise, the paper does not explore the trade‑off between diversity (higher temperature) and statistical power for downstream user‑facing applications.
Future directions – Extending the noise taxonomy to human‑in‑the‑loop evaluations, integrating Bayesian hierarchical models for even tighter uncertainty estimates, and building a public “noise leaderboard” for emerging LLMs.

Authors

Sida Wang

Paper Information

arXiv ID: 2512.21326v1
Categories: cs.LG, cs.AI, cs.CL, stat.ML
Published: December 24, 2025
PDF: Download PDF

[Paper] Measuring all the noises of LLM Evals

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty