[Paper] Mediocrity is the key for LLM as a Judge Anchor Selection

Published: 3 days ago (March 17, 2026 at 01:54 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.16848v1

Overview

The paper “Mediocrity is the key for LLM as a Judge Anchor Selection” investigates a hidden but crucial design choice in the popular “LLM‑as‑a‑judge” evaluation pipeline: which model should be used as the anchor when comparing many language models pairwise. By systematically testing 22 different anchors on the Arena‑Hard‑v2.0 benchmark, the authors show that the anchor can make or break the correlation with human judgments, and that the commonly‑used “best” or “worst” models are actually the worst choices.

Key Contributions

Empirical audit of anchor impact – evaluated 22 distinct anchor models on a large‑scale pairwise benchmark (Arena‑Hard‑v2.0) and measured correlation with human rankings.
Identification of “mediocre” anchors – demonstrated that anchors with mid‑range performance (neither top‑ nor bottom‑ranked) yield the most reliable relative rankings.
Quantitative effect‑size analysis – showed that the variance introduced by anchor choice is on par with the variance caused by swapping the judge LLM itself.
Power analysis for benchmark sizing – derived the minimum number of comparison pairs needed to distinguish competitive models with statistical confidence.
Actionable guidelines – provided concrete recommendations for selecting anchors and sizing benchmarks in future LLM‑as‑a‑judge evaluations.

Methodology

Dataset & Baselines – The authors used the Arena‑Hard‑v2.0 dataset, which contains human‑rated pairwise comparisons of responses from 21 LLMs across a variety of prompts.
Anchor pool – 22 candidate anchors were assembled, ranging from the strongest model (e.g., GPT‑4‑style) to the weakest open‑source baseline, plus several “middle‑of‑the‑road” models.
Pairwise evaluation pipeline – For each anchor, every target model’s output was compared against the anchor’s output using a fixed judge LLM (the “evaluator”). The judge’s decision was then aggregated into a ranking of the target models.
Correlation measurement – The resulting rankings were compared to the human‑derived gold ranking using Kendall’s τ and Spearman’s ρ.
Effect‑size & power analysis – Statistical techniques (ANOVA, bootstrap resampling) quantified how much anchor choice shifts the correlation and estimated the sample size needed to achieve a desired confidence level.

Results & Findings

Anchor type	Correlation with human ranking (τ)	Observations
Top‑performing (best)	~0.30	Consistently over‑estimates all other models, compressing the ranking signal.
Bottom‑performing (worst)	~0.28	Under‑estimates most models, leading to similar compression.
Mediocre (mid‑range)	~0.55–0.60	Preserves relative differences; highest alignment with human judgments.
Randomly selected	~0.45	Better than extremes but still variable.

Anchor effect size: Switching from a “best” to a “mediocre” anchor changes τ by ~0.25, comparable to swapping the judge LLM from GPT‑3.5 to GPT‑4.
Benchmark size: With the standard 200‑pair sample used in many public benchmarks, the 95 % confidence interval for τ spans ±0.12, making it impossible to reliably separate models that differ by <0.1 in human ranking. The authors recommend at least 800–1,000 pairwise comparisons for robust discrimination.

Practical Implications

Evaluation pipelines: Teams building or benchmarking new LLMs should avoid using the strongest or weakest model as the anchor. Instead, pick a model that sits in the middle of the performance spectrum (e.g., a well‑tuned open‑source model that is neither state‑of‑the‑art nor a baseline).
Resource budgeting: Knowing that anchor choice can double the variance of results, developers can allocate more budget to careful anchor selection rather than simply scaling up the number of judge calls.
Benchmark design: Public leaderboards (e.g., OpenAI’s ChatGPT Arena, HuggingFace’s model‑eval suites) can improve credibility by publishing the anchor model and the number of pairwise comparisons, and by adopting the recommended sample sizes.
Automated tooling: The paper’s power‑analysis formulas can be baked into evaluation libraries (e.g., lm-eval, OpenAI evals) to automatically suggest the minimum number of comparisons needed for a given confidence target.

Limitations & Future Work

Judge model dependency – The study kept the judge LLM fixed; different judges may interact with anchors in non‑linear ways, which warrants a broader cross‑judge analysis.
Domain coverage – Arena‑Hard‑v2.0 focuses on instruction‑following tasks; the findings may not transfer directly to code generation, reasoning‑heavy prompts, or multimodal outputs.
Dynamic anchors – The authors suggest exploring adaptive anchor selection, where the anchor evolves during evaluation based on interim results—a promising direction for future research.

Authors

Shachar Don-Yehiya
Asaf Yehudai
Leshem Choshen
Omri Abend

Paper Information

arXiv ID: 2603.16848v1
Categories: cs.CL
Published: March 17, 2026
PDF: Download PDF

[Paper] Mediocrity is the key for LLM as a Judge Anchor Selection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Online Learning and Equilibrium Computation with Ranking Feedback

[Paper] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation