[Paper] Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Source: arXiv - 2604.15302v1
Overview
Large language models (LLMs) are increasingly being used as judges to automatically score generated text (e.g., summaries, translations). While they work well on average, we still don’t know how trustworthy a single model’s score is for a particular document. This paper introduces two diagnostic tools—transitivity analysis and split‑conformal prediction sets—that expose per‑instance reliability issues in LLM‑as‑judge pipelines, using the widely‑cited SummEval benchmark as a testbed.
Key Contributions
- Transitivity diagnostic: Detects directed 3‑cycles (A > B, B > C, C > A) in pairwise judgments, revealing that 33‑67 % of documents contain at least one inconsistency even though overall violation rates appear low (≤ 4 %).
- Conformal prediction sets for Likert scores: Generates per‑document score intervals with provable coverage (≥ 1 − α). The width of these intervals serves as a reliable “difficulty” signal.
- Cross‑judge consistency of set width: Shows that prediction‑set width correlates across four independent judges (average Pearson r ≈ 0.35), indicating it captures intrinsic document difficulty rather than judge‑specific noise.
- Criterion‑level reliability ranking: Finds that relevance judgments are the most stable (average set size ≈ 3.0), coherence is moderate (≈ 3.9), while fluency and consistency are the least reliable (≈ 4.9).
- Open‑source release: All code, prompts, and cached LLM responses are made publicly available, enabling reproducibility and further research.
Methodology
- Dataset & Judges – The authors use SummEval, which contains human‑written summaries evaluated on four criteria (relevance, coherence, fluency, consistency). Four separate LLM prompts act as “judges.”
- Transitivity analysis – For each document, the system generates pairwise comparisons among three candidate summaries. A directed 3‑cycle indicates an inconsistency (e.g., the model says S1 > S2, S2 > S3, yet S3 > S1). The proportion of documents with any cycle is reported.
- Split conformal prediction – The dataset is split into a calibration set and a test set. For each test instance, the model predicts a probability distribution over the 1‑5 Likert scale. Using the calibration residuals, the method builds a prediction set that contains the true score with probability ≥ 1 − α (typically α = 0.1). The set width (number of scores in the interval) is taken as a per‑instance reliability metric.
- Correlation analysis – Pearson correlation of set widths across judges quantifies whether the metric reflects document difficulty rather than random judge variance.
- Statistical validation – Correlations are aggregated over 1,918 judgments, yielding a highly significant result (p < 10⁻¹⁰⁰).
Results & Findings
- Transitivity violations: Although the average violation rate is modest (0.8‑4.1 %), a majority of documents (33‑67 %) contain at least one 3‑cycle, exposing hidden inconsistency.
- Prediction‑set coverage: Conformal sets achieve the promised coverage (≥ 90 % for α = 0.1) across all judges and criteria.
- Set‑width as reliability signal: Wider sets (≈ 5 scores) correspond to low confidence, while narrower sets (≈ 3 scores) indicate higher confidence. The correlation of set widths across judges (r ≈ 0.32‑0.38) confirms that the signal is document‑specific.
- Criterion hierarchy:
- Relevance: most reliable (average set size ≈ 3.0).
- Coherence: moderately reliable (≈ 3.9).
- Fluency & Consistency: least reliable (≈ 4.9).
- Judge vs. criterion effect: The choice of evaluation criterion matters more than the particular LLM judge, suggesting that some aspects of text quality are inherently harder for LLMs to assess.
Practical Implications
- Better automated evaluation pipelines – Developers can flag low‑confidence judgments (wide conformal sets) and either request human review or discard them, improving overall evaluation quality.
- Model selection & prompt engineering – Knowing that relevance is the most stable criterion can guide teams to prioritize LLM‑based relevance scoring while treating fluency/consistency scores with caution.
- Dynamic budgeting for human annotation – By estimating per‑document difficulty, teams can allocate human annotators only where the LLM’s confidence is low, reducing labeling costs.
- Benchmark design – Future NLG benchmarks can incorporate transitivity checks and conformal set reporting as standard diagnostics, leading to more transparent leaderboards.
- Tooling – The released code can be integrated into CI pipelines for continuous monitoring of LLM‑as‑judge reliability in production systems (e.g., summarization‑as‑a‑service platforms).
Limitations & Future Work
- Scope limited to SummEval – The diagnostics are demonstrated on a single benchmark; broader validation on other tasks (e.g., translation, dialogue) is needed.
- Dependence on calibration set size – Split conformal prediction requires a sufficiently large, representative calibration split; small or highly skewed datasets may yield less reliable intervals.
- Prompt variability – The study uses fixed prompts; exploring how prompt engineering influences transitivity and set‑width could uncover additional robustness strategies.
- Extension to multi‑dimensional scores – Current work treats each Likert dimension independently; joint modeling of criteria might improve reliability estimates.
Overall, the paper equips developers with concrete, statistically sound tools to gauge when an LLM judge can be trusted—and when it can’t—paving the way for more reliable, cost‑effective automated text evaluation.
Authors
- Manan Gupta
- Dhruv Kumar
Paper Information
- arXiv ID: 2604.15302v1
- Categories: cs.AI, cs.CL, cs.LG
- Published: April 16, 2026
- PDF: Download PDF