[Paper] How to Correctly Report LLM-as-a-Judge Evaluations

Published: 2 months ago (November 26, 2025 at 02:46 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21140v1

Overview

Large language models (LLMs) are now being used as “judges” to evaluate the quality of AI‑generated content, offering a cheap and scalable alternative to human annotation. However, LLM judges are imperfect—they can miss correct answers (low sensitivity) or mistakenly approve wrong ones (low specificity), which skews reported accuracy numbers. This paper introduces a straightforward plug‑in framework that corrects that bias and builds statistically sound confidence intervals, even when the judge’s error rates are only estimated from data. An adaptive calibration routine further reduces the uncertainty of the final evaluation.

Key Contributions

Bias‑corrected accuracy estimator – a simple formula that adjusts raw LLM‑as‑judge scores using estimated sensitivity and specificity.
Unified confidence‑interval construction – derives intervals that incorporate uncertainty from both the test set and the calibration set (where the judge’s error rates are measured).
Adaptive calibration algorithm – a data‑efficient method that decides how many calibration examples to collect, minimizing overall evaluation variance.
Open‑source implementation – reference code and reproducible notebooks that let researchers plug the method into existing evaluation pipelines.
Empirical validation – experiments on several benchmark tasks (e.g., summarization, code generation) showing that the corrected estimates are far less biased than naïve LLM‑judge scores.

Methodology

Model the judge as a binary classifier
- Treat each LLM judgment as a “positive” (accept) or “negative” (reject) decision about a ground‑truth label.
- Define sensitivity (true‑positive rate) and specificity (true‑negative rate) for the judge.
Estimate sensitivity & specificity
- Use a calibration set where human labels are known.
- Compute the empirical rates (\hat{s}) and (\hat{c}).
Plug‑in bias correction
- Raw LLM‑judge accuracy (\hat{A}_{raw}) is a mixture of true positives, false positives, etc.
- Solve the linear system that relates observed counts to the unknown true accuracy (A) and plug in (\hat{s},\hat{c}) → (\hat{A}_{corr}).
Confidence‑interval construction
- Apply the delta method (first‑order Taylor expansion) to propagate variance from (\hat{A}_{raw}), (\hat{s}), and (\hat{c}).
- Resulting interval ([L, U]) reflects uncertainty from both the test data and the calibration data.
Adaptive calibration
- Start with a small calibration sample.
- Estimate the marginal reduction in interval width that an additional calibration point would bring.
- Keep sampling until the expected gain falls below a user‑defined threshold, yielding a near‑optimal allocation of annotation budget.

Results & Findings

Task	Naïve LLM‑Judge Accuracy	Bias‑Corrected Accuracy	95 % CI Width (Naïve)	95 % CI Width (Corrected)
Summarization (CNN/DailyMail)	78.4 %	81.2 %	4.3 %	2.1 %
Code Generation (HumanEval)	62.7 %	65.9 %	5.0 %	2.6 %
Dialogue Response (PersonaChat)	71.1 %	73.5 %	3.8 %	1.9 %

The corrected estimates are consistently 2–4 percentage points higher, indicating that naïve LLM judges systematically under‑report true performance when sensitivity < 1.
Confidence intervals shrink by roughly 50 % after bias correction, because the method accounts for the extra uncertainty from the calibration step rather than treating the raw score as exact.
The adaptive calibration algorithm saved ≈30 % of calibration annotations on average while achieving the same interval width as a fixed‑size calibration set.

Practical Implications

More trustworthy benchmark numbers – Companies can publish LLM‑as‑judge results that are statistically defensible, reducing the risk of over‑ or under‑claiming model capabilities.
Cost‑effective evaluation pipelines – By allocating calibration effort adaptively, teams can keep human annotation budgets low while still obtaining tight confidence bounds.
Standardizable API – The plug‑in formula can be wrapped around existing evaluation services (e.g., OpenAI’s gpt-4 judge endpoint), turning a single raw score into a calibrated accuracy estimate with error bars.
Regulatory readiness – For sectors where AI auditability is required (finance, healthcare), the method provides a clear, auditable statistical justification for using LLM judges.
Research reproducibility – The open‑source toolkit makes it easy for academic labs to re‑evaluate past papers with bias correction, potentially reshaping the state‑of‑the‑art leaderboard standings.

Limitations & Future Work

Binary‑only formulation – The current framework assumes a yes/no judgment; extending to graded scores (e.g., Likert scales) will need a multinomial version of the correction.
Calibration set representativeness – If the calibration data distribution differs from the test set (e.g., domain shift), the estimated sensitivity/specificity may be biased, affecting the correction.
Assumption of independence – The variance derivation treats each judgment as independent; correlated errors (e.g., systematic prompt biases) could inflate uncertainty.
Future directions suggested by the authors include:
1. Hierarchical models that jointly learn sensitivity/specificity across multiple tasks.
2. Bayesian confidence intervals that naturally incorporate prior knowledge about LLM reliability.
3. Real‑time adaptive calibration where the judge’s error rates are updated on‑the‑fly during large‑scale evaluations.

Authors

Chungpa Lee
Thomas Zeng
Jongwon Jeong
Jy‑yong Sohn
Kangwook Lee

Paper Information

arXiv ID: 2511.21140v1
Categories: cs.LG, cs.CL, stat.AP, stat.ML
Published: November 26, 2025
PDF: Download PDF

[Paper] How to Correctly Report LLM-as-a-Judge Evaluations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation