[Paper] Who can we trust? LLM-as-a-jury for Comparative Assessment

Published: 2 months ago (February 18, 2026 at 12:04 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16610v1

Overview

The paper investigates a growing trend: using large language models (LLMs) as automated judges to compare the quality of generated text (e.g., summaries, translations, code comments). While it’s tempting to let a single LLM or a simple average of many LLMs decide which output is better, the authors show that these judges are far from consistent. They introduce BT‑sigma, a statistical framework that treats each LLM as a “juror” with its own reliability score, allowing the system to infer both the ranking of the candidate texts and how trustworthy each LLM’s opinions are—without any human‑labeled calibration data.

Key Contributions

Empirical evidence of inconsistency: Demonstrates that LLMs produce biased and contradictory pairwise comparison probabilities across tasks.
BT‑sigma model: Extends the classic Bradley‑Terry ranking model with a per‑judge discriminator parameter that captures each LLM’s reliability.
Joint inference: Simultaneously learns item rankings and judge reliability solely from pairwise comparison data.
Performance gains: Shows consistent improvement over naïve averaging methods on several NLG evaluation benchmarks.
Interpretability: Finds a strong correlation between the learned discriminator and independent measures of LLM judgment consistency, effectively providing an unsupervised calibration tool.

Methodology

Data collection: The authors generate pairwise comparison logs from multiple LLMs (e.g., GPT‑3.5, Claude, LLaMA) on standard NLG evaluation datasets. Each log records the probability that a judge prefers output A over B.
Baseline: The common practice is to average these probabilities across judges and rank items by the resulting scores.
Bradley‑Terry foundation: The classic Bradley‑Terry model assumes that the probability of A beating B depends on a latent “skill” score for each item.
BT‑sigma extension:
- Adds a discriminator σᵢ for each judge i.
- The comparison probability becomes:
  [ P_{i}(A \succ B) = \sigma_i \cdot \frac{e^{\theta_A}}{e^{\theta_A}+e^{\theta_B}} ]
  where θₐ, θ_b are the latent quality scores of the items.
- σᵢ ∈ (0, 1] scales the influence of judge i: a low σᵢ down‑weights noisy or biased judges.
Joint optimization: Using maximum likelihood over all observed pairwise outcomes, the algorithm iteratively updates item scores (θ) and judge discriminators (σ) until convergence.
Evaluation: Rankings produced by BT‑sigma are compared against human judgments (the gold standard) using Kendall’s τ and pairwise accuracy.

Results & Findings

Dataset	Avg. Baseline Accuracy	BT‑sigma Accuracy	Δ
SummEval (summarization)	71.2 %	76.5 %	+5.3 %
MT-Bench (translation)	68.9 %	73.8 %	+4.9 %
CodeEval (code comment)	73.4 %	78.1 %	+4.7 %

Consistent uplift: BT‑sigma outperforms simple averaging across all benchmarks, narrowing the gap to human‑rated rankings.
Discriminator validity: Judges with higher σ values also exhibit higher cycle consistency (i.e., A > B, B > C ⇒ A > C) when examined independently, confirming that σ captures genuine reliability.
Robustness to missing supervision: The model does not require any human‑annotated calibration data; it learns reliability purely from the pattern of contradictions in the LLM judgments themselves.

Practical Implications

Better automated evaluation pipelines: Teams building NLG systems can replace fragile “majority‑vote” or raw probability averaging with BT‑sigma, gaining rankings that align more closely with human preferences.
Dynamic judge selection: The discriminator scores can be used to automatically prune or down‑weight underperforming LLMs in a multi‑model ensemble, saving compute budget.
Unsupervised calibration: In scenarios where human evaluation is too costly (e.g., continuous integration testing of chat‑bot responses), BT‑sigma offers a self‑calibrating metric that flags when a model’s judgments become erratic.
Cross‑model benchmarking: Researchers can compare new LLMs against established ones by looking at their σ scores on a shared set of pairwise tasks, providing a quick reliability fingerprint.

Limitations & Future Work

Assumption of independence: BT‑sigma treats each judge’s errors as independent; correlated biases (e.g., two models fine‑tuned on the same data) could still skew results.
Scalability: Joint inference becomes heavier as the number of items and judges grows; the paper suggests stochastic EM variants but leaves full‑scale deployment to future research.
Domain transfer: The experiments focus on English NLG tasks; applying the method to multilingual or multimodal generation remains an open question.
Human‑in‑the‑loop extensions: Incorporating a small set of human labels to further anchor σ values could improve robustness, a direction the authors plan to explore.

Authors

Mengjie Qian
Guangzhi Sun
Mark J. F. Gales
Kate M. Knill

Paper Information

arXiv ID: 2602.16610v1
Categories: cs.CL, cs.AI, cs.LG
Published: February 18, 2026
PDF: Download PDF

[Paper] Who can we trust? LLM-as-a-jury for Comparative Assessment

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Validating Political Position Predictions of Arguments

[Paper] Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

[Paper] On the 'Induction Bias' in Sequence Models