[Paper] Who can we trust? LLM-as-a-jury for Comparative Assessment
Source: arXiv - 2602.16610v1
Overview
The paper investigates a growing trend: using large language models (LLMs) as automated judges to compare the quality of generated text (e.g., summaries, translations, code comments). While it’s tempting to let a single LLM or a simple average of many LLMs decide which output is better, the authors show that these judges are far from consistent. They introduce BT‑sigma, a statistical framework that treats each LLM as a “juror” with its own reliability score, allowing the system to infer both the ranking of the candidate texts and how trustworthy each LLM’s opinions are—without any human‑labeled calibration data.
Key Contributions
- Empirical evidence of inconsistency: Demonstrates that LLMs produce biased and contradictory pairwise comparison probabilities across tasks.
- BT‑sigma model: Extends the classic Bradley‑Terry ranking model with a per‑judge discriminator parameter that captures each LLM’s reliability.
- Joint inference: Simultaneously learns item rankings and judge reliability solely from pairwise comparison data.
- Performance gains: Shows consistent improvement over naïve averaging methods on several NLG evaluation benchmarks.
- Interpretability: Finds a strong correlation between the learned discriminator and independent measures of LLM judgment consistency, effectively providing an unsupervised calibration tool.
Methodology
- Data collection: The authors generate pairwise comparison logs from multiple LLMs (e.g., GPT‑3.5, Claude, LLaMA) on standard NLG evaluation datasets. Each log records the probability that a judge prefers output A over B.
- Baseline: The common practice is to average these probabilities across judges and rank items by the resulting scores.
- Bradley‑Terry foundation: The classic Bradley‑Terry model assumes that the probability of A beating B depends on a latent “skill” score for each item.
- BT‑sigma extension:
-
Adds a discriminator σᵢ for each judge i.
-
The comparison probability becomes:
[ P_{i}(A \succ B) = \sigma_i \cdot \frac{e^{\theta_A}}{e^{\theta_A}+e^{\theta_B}} ]
where θₐ, θ_b are the latent quality scores of the items.
-
σᵢ ∈ (0, 1] scales the influence of judge i: a low σᵢ down‑weights noisy or biased judges.
-
- Joint optimization: Using maximum likelihood over all observed pairwise outcomes, the algorithm iteratively updates item scores (θ) and judge discriminators (σ) until convergence.
- Evaluation: Rankings produced by BT‑sigma are compared against human judgments (the gold standard) using Kendall’s τ and pairwise accuracy.
Results & Findings
| Dataset | Avg. Baseline Accuracy | BT‑sigma Accuracy | Δ |
|---|---|---|---|
| SummEval (summarization) | 71.2 % | 76.5 % | +5.3 % |
| MT-Bench (translation) | 68.9 % | 73.8 % | +4.9 % |
| CodeEval (code comment) | 73.4 % | 78.1 % | +4.7 % |
- Consistent uplift: BT‑sigma outperforms simple averaging across all benchmarks, narrowing the gap to human‑rated rankings.
- Discriminator validity: Judges with higher σ values also exhibit higher cycle consistency (i.e., A > B, B > C ⇒ A > C) when examined independently, confirming that σ captures genuine reliability.
- Robustness to missing supervision: The model does not require any human‑annotated calibration data; it learns reliability purely from the pattern of contradictions in the LLM judgments themselves.
Practical Implications
- Better automated evaluation pipelines: Teams building NLG systems can replace fragile “majority‑vote” or raw probability averaging with BT‑sigma, gaining rankings that align more closely with human preferences.
- Dynamic judge selection: The discriminator scores can be used to automatically prune or down‑weight underperforming LLMs in a multi‑model ensemble, saving compute budget.
- Unsupervised calibration: In scenarios where human evaluation is too costly (e.g., continuous integration testing of chat‑bot responses), BT‑sigma offers a self‑calibrating metric that flags when a model’s judgments become erratic.
- Cross‑model benchmarking: Researchers can compare new LLMs against established ones by looking at their σ scores on a shared set of pairwise tasks, providing a quick reliability fingerprint.
Limitations & Future Work
- Assumption of independence: BT‑sigma treats each judge’s errors as independent; correlated biases (e.g., two models fine‑tuned on the same data) could still skew results.
- Scalability: Joint inference becomes heavier as the number of items and judges grows; the paper suggests stochastic EM variants but leaves full‑scale deployment to future research.
- Domain transfer: The experiments focus on English NLG tasks; applying the method to multilingual or multimodal generation remains an open question.
- Human‑in‑the‑loop extensions: Incorporating a small set of human labels to further anchor σ values could improve robustness, a direction the authors plan to explore.
Authors
- Mengjie Qian
- Guangzhi Sun
- Mark J. F. Gales
- Kate M. Knill
Paper Information
- arXiv ID: 2602.16610v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: February 18, 2026
- PDF: Download PDF