[Paper] Who can we trust? LLM-as-a-jury for Comparative Assessment

Published: (February 18, 2026 at 12:04 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.16610v1

Overview

The paper investigates a growing trend: using large language models (LLMs) as automated judges to compare the quality of generated text (e.g., summaries, translations, code comments). While it’s tempting to let a single LLM or a simple average of many LLMs decide which output is better, the authors show that these judges are far from consistent. They introduce BT‑sigma, a statistical framework that treats each LLM as a “juror” with its own reliability score, allowing the system to infer both the ranking of the candidate texts and how trustworthy each LLM’s opinions are—without any human‑labeled calibration data.

Key Contributions

  • Empirical evidence of inconsistency: Demonstrates that LLMs produce biased and contradictory pairwise comparison probabilities across tasks.
  • BT‑sigma model: Extends the classic Bradley‑Terry ranking model with a per‑judge discriminator parameter that captures each LLM’s reliability.
  • Joint inference: Simultaneously learns item rankings and judge reliability solely from pairwise comparison data.
  • Performance gains: Shows consistent improvement over naïve averaging methods on several NLG evaluation benchmarks.
  • Interpretability: Finds a strong correlation between the learned discriminator and independent measures of LLM judgment consistency, effectively providing an unsupervised calibration tool.

Methodology

  1. Data collection: The authors generate pairwise comparison logs from multiple LLMs (e.g., GPT‑3.5, Claude, LLaMA) on standard NLG evaluation datasets. Each log records the probability that a judge prefers output A over B.
  2. Baseline: The common practice is to average these probabilities across judges and rank items by the resulting scores.
  3. Bradley‑Terry foundation: The classic Bradley‑Terry model assumes that the probability of A beating B depends on a latent “skill” score for each item.
  4. BT‑sigma extension:
    • Adds a discriminator σᵢ for each judge i.

    • The comparison probability becomes:

      [ P_{i}(A \succ B) = \sigma_i \cdot \frac{e^{\theta_A}}{e^{\theta_A}+e^{\theta_B}} ]

      where θₐ, θ_b are the latent quality scores of the items.

    • σᵢ ∈ (0, 1] scales the influence of judge i: a low σᵢ down‑weights noisy or biased judges.

  5. Joint optimization: Using maximum likelihood over all observed pairwise outcomes, the algorithm iteratively updates item scores (θ) and judge discriminators (σ) until convergence.
  6. Evaluation: Rankings produced by BT‑sigma are compared against human judgments (the gold standard) using Kendall’s τ and pairwise accuracy.

Results & Findings

DatasetAvg. Baseline AccuracyBT‑sigma AccuracyΔ
SummEval (summarization)71.2 %76.5 %+5.3 %
MT-Bench (translation)68.9 %73.8 %+4.9 %
CodeEval (code comment)73.4 %78.1 %+4.7 %
  • Consistent uplift: BT‑sigma outperforms simple averaging across all benchmarks, narrowing the gap to human‑rated rankings.
  • Discriminator validity: Judges with higher σ values also exhibit higher cycle consistency (i.e., A > B, B > C ⇒ A > C) when examined independently, confirming that σ captures genuine reliability.
  • Robustness to missing supervision: The model does not require any human‑annotated calibration data; it learns reliability purely from the pattern of contradictions in the LLM judgments themselves.

Practical Implications

  • Better automated evaluation pipelines: Teams building NLG systems can replace fragile “majority‑vote” or raw probability averaging with BT‑sigma, gaining rankings that align more closely with human preferences.
  • Dynamic judge selection: The discriminator scores can be used to automatically prune or down‑weight underperforming LLMs in a multi‑model ensemble, saving compute budget.
  • Unsupervised calibration: In scenarios where human evaluation is too costly (e.g., continuous integration testing of chat‑bot responses), BT‑sigma offers a self‑calibrating metric that flags when a model’s judgments become erratic.
  • Cross‑model benchmarking: Researchers can compare new LLMs against established ones by looking at their σ scores on a shared set of pairwise tasks, providing a quick reliability fingerprint.

Limitations & Future Work

  • Assumption of independence: BT‑sigma treats each judge’s errors as independent; correlated biases (e.g., two models fine‑tuned on the same data) could still skew results.
  • Scalability: Joint inference becomes heavier as the number of items and judges grows; the paper suggests stochastic EM variants but leaves full‑scale deployment to future research.
  • Domain transfer: The experiments focus on English NLG tasks; applying the method to multilingual or multimodal generation remains an open question.
  • Human‑in‑the‑loop extensions: Incorporating a small set of human labels to further anchor σ values could improve robustness, a direction the authors plan to explore.

Authors

  • Mengjie Qian
  • Guangzhi Sun
  • Mark J. F. Gales
  • Kate M. Knill

Paper Information

  • arXiv ID: 2602.16610v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »