[Paper] Mediocrity is the key for LLM as a Judge Anchor Selection
Source: arXiv - 2603.16848v1
Overview
The paper “Mediocrity is the key for LLM as a Judge Anchor Selection” investigates a hidden but crucial design choice in the popular “LLM‑as‑a‑judge” evaluation pipeline: which model should be used as the anchor when comparing many language models pairwise. By systematically testing 22 different anchors on the Arena‑Hard‑v2.0 benchmark, the authors show that the anchor can make or break the correlation with human judgments, and that the commonly‑used “best” or “worst” models are actually the worst choices.
Key Contributions
- Empirical audit of anchor impact – evaluated 22 distinct anchor models on a large‑scale pairwise benchmark (Arena‑Hard‑v2.0) and measured correlation with human rankings.
- Identification of “mediocre” anchors – demonstrated that anchors with mid‑range performance (neither top‑ nor bottom‑ranked) yield the most reliable relative rankings.
- Quantitative effect‑size analysis – showed that the variance introduced by anchor choice is on par with the variance caused by swapping the judge LLM itself.
- Power analysis for benchmark sizing – derived the minimum number of comparison pairs needed to distinguish competitive models with statistical confidence.
- Actionable guidelines – provided concrete recommendations for selecting anchors and sizing benchmarks in future LLM‑as‑a‑judge evaluations.
Methodology
- Dataset & Baselines – The authors used the Arena‑Hard‑v2.0 dataset, which contains human‑rated pairwise comparisons of responses from 21 LLMs across a variety of prompts.
- Anchor pool – 22 candidate anchors were assembled, ranging from the strongest model (e.g., GPT‑4‑style) to the weakest open‑source baseline, plus several “middle‑of‑the‑road” models.
- Pairwise evaluation pipeline – For each anchor, every target model’s output was compared against the anchor’s output using a fixed judge LLM (the “evaluator”). The judge’s decision was then aggregated into a ranking of the target models.
- Correlation measurement – The resulting rankings were compared to the human‑derived gold ranking using Kendall’s τ and Spearman’s ρ.
- Effect‑size & power analysis – Statistical techniques (ANOVA, bootstrap resampling) quantified how much anchor choice shifts the correlation and estimated the sample size needed to achieve a desired confidence level.
Results & Findings
| Anchor type | Correlation with human ranking (τ) | Observations |
|---|---|---|
| Top‑performing (best) | ~0.30 | Consistently over‑estimates all other models, compressing the ranking signal. |
| Bottom‑performing (worst) | ~0.28 | Under‑estimates most models, leading to similar compression. |
| Mediocre (mid‑range) | ~0.55–0.60 | Preserves relative differences; highest alignment with human judgments. |
| Randomly selected | ~0.45 | Better than extremes but still variable. |
- Anchor effect size: Switching from a “best” to a “mediocre” anchor changes τ by ~0.25, comparable to swapping the judge LLM from GPT‑3.5 to GPT‑4.
- Benchmark size: With the standard 200‑pair sample used in many public benchmarks, the 95 % confidence interval for τ spans ±0.12, making it impossible to reliably separate models that differ by <0.1 in human ranking. The authors recommend at least 800–1,000 pairwise comparisons for robust discrimination.
Practical Implications
- Evaluation pipelines: Teams building or benchmarking new LLMs should avoid using the strongest or weakest model as the anchor. Instead, pick a model that sits in the middle of the performance spectrum (e.g., a well‑tuned open‑source model that is neither state‑of‑the‑art nor a baseline).
- Resource budgeting: Knowing that anchor choice can double the variance of results, developers can allocate more budget to careful anchor selection rather than simply scaling up the number of judge calls.
- Benchmark design: Public leaderboards (e.g., OpenAI’s ChatGPT Arena, HuggingFace’s model‑eval suites) can improve credibility by publishing the anchor model and the number of pairwise comparisons, and by adopting the recommended sample sizes.
- Automated tooling: The paper’s power‑analysis formulas can be baked into evaluation libraries (e.g.,
lm-eval,OpenAI evals) to automatically suggest the minimum number of comparisons needed for a given confidence target.
Limitations & Future Work
- Judge model dependency – The study kept the judge LLM fixed; different judges may interact with anchors in non‑linear ways, which warrants a broader cross‑judge analysis.
- Domain coverage – Arena‑Hard‑v2.0 focuses on instruction‑following tasks; the findings may not transfer directly to code generation, reasoning‑heavy prompts, or multimodal outputs.
- Dynamic anchors – The authors suggest exploring adaptive anchor selection, where the anchor evolves during evaluation based on interim results—a promising direction for future research.
Authors
- Shachar Don-Yehiya
- Asaf Yehudai
- Leshem Choshen
- Omri Abend
Paper Information
- arXiv ID: 2603.16848v1
- Categories: cs.CL
- Published: March 17, 2026
- PDF: Download PDF