[Paper] SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Source: arXiv - 2602.13110v1
Overview
Large language models (LLMs) are increasingly being used as automated judges for pairwise comparisons—deciding which of two model outputs is better—so that developers can avoid costly human labeling. The paper “SCOPE: Selective Conformal Optimized Pairwise LLM Judging” introduces a statistically‑grounded framework that lets LLM judges opt‑out when they are uncertain, while guaranteeing that the error rate of the judgments they do make stays below a user‑defined threshold.
Key Contributions
- SCOPE framework: A selective‑prediction system that couples conformal calibration with a user‑specified risk level α, ensuring that the proportion of wrong judgments among the accepted ones never exceeds α (with finite‑sample guarantees).
- Bidirectional Preference Entropy (BPE): A novel uncertainty metric that queries the LLM twice—once with each candidate placed in the “first” position—aggregates the implied preference probabilities, and converts them into an entropy‑based score that is invariant to answer ordering.
- Empirical validation: Extensive experiments on three widely‑used evaluation suites (MT‑Bench, RewardBench, Chatbot Arena) show that BPE provides a stronger selection signal than raw confidence scores, allowing SCOPE to meet the target risk while keeping high coverage (up to 98 % of judgments retained).
- Scalability across model sizes: Demonstrated consistent performance from 7 B‑parameter models up to 32 B‑parameter models, highlighting that the method works for both small and large LLM judges.
Methodology
- Pairwise Judging as a Binary Decision
- For each pair (A, B) the LLM outputs a preference probability p that A is better than B.
- Bidirectional Querying
- The same pair is fed to the LLM twice, swapping the order (A‑first, B‑first). This yields two probabilities, p₁ and p₂, which are combined to produce a symmetrized preference distribution.
- Entropy‑Based Uncertainty (BPE)
- The symmetrized distribution is transformed into an entropy value: higher entropy → greater uncertainty about the true preference.
- Conformal Calibration
- A calibration set is used to learn a threshold τ such that the empirical error of all judgments with entropy ≤ τ stays below α. This is done via the classic split‑conformal method, guaranteeing the risk bound even with a finite number of samples.
- Selective Acceptance
- At inference time, the LLM’s BPE for a pair is compared to τ. If the entropy is low (i.e., the model is confident), the judgment is accepted; otherwise the system abstains, leaving the pair to be evaluated by a human or a higher‑cost oracle.
The whole pipeline is lightweight: it only requires two forward passes per pair and a simple threshold lookup, making it practical for large‑scale evaluation pipelines.
Results & Findings
| Benchmark | Model (size) | Target α | Empirical Risk | Coverage (accepted judgments) |
|---|---|---|---|---|
| MT‑Bench | Qwen‑7B | 0.10 | 0.098 | 0.71 |
| RewardBench | Qwen‑14B | 0.10 | 0.097 | 0.89 |
| RewardBench | Qwen‑32B | 0.10 | 0.099 | 0.98 |
| Chatbot Arena | Various | 0.10 | ≈0.10 | 0.80‑0.95 (depending on model) |
- Risk guarantee: Across all settings the observed error stays within the prescribed α = 0.10, confirming the finite‑sample conformal guarantee.
- Coverage boost: Compared to a naïve baseline that uses raw softmax confidence, SCOPE accepts up to 2.4× more judgments on MT‑Bench with the 7 B model while still respecting the risk bound.
- BPE vs. confidence: Entropy derived from bidirectional querying consistently yields a tighter correlation with actual mistake probability, making it a more reliable abstention trigger.
Practical Implications
- Cost‑effective evaluation pipelines: Teams can replace a large fraction of human pairwise annotations with LLM judges, only falling back to humans when the model signals high uncertainty. This reduces labeling spend without sacrificing evaluation reliability.
- Safety‑aware model ranking: In settings where a wrong ranking could have downstream risks (e.g., selecting a dialogue model for customer support), SCOPE’s risk guarantee provides a quantifiable safety net.
- Plug‑and‑play component: Because BPE only needs two forward passes and conformal calibration is model‑agnostic, developers can integrate SCOPE into existing benchmarking suites (e.g., OpenAI’s
evals, Hugging Facedatasets) with minimal engineering effort. - Scalable to any LLM size: The method works from 7 B to 32 B parameters, meaning even smaller, cheaper LLM judges can be used effectively, expanding applicability to edge or on‑premise environments.
Limitations & Future Work
- Exchangeability assumption: The conformal guarantee relies on the calibration and test pairs being exchangeable (i.i.d.). In practice, data drift or domain shifts could weaken the risk bound.
- Calibration cost: A separate calibration set is needed for each model and α value; generating this set still requires some human judgments.
- Binary preference only: The current formulation handles pairwise “A > B” decisions. Extending to multi‑candidate ranking or graded preference (e.g., “A is slightly better than B”) is left for future research.
- Potential bias in BPE: While BPE mitigates order bias, any systematic bias present in the underlying LLM (e.g., cultural or toxicity bias) will still affect the final judgments. Investigating bias‑aware uncertainty metrics is an open direction.
Bottom line: SCOPE offers a practical, statistically sound way for developers to lean on LLMs for large‑scale pairwise evaluation, keeping error rates in check while dramatically cutting human labeling costs. If you’re building a model‑ranking pipeline or need a trustworthy automated judge, giving SCOPE a try could be a game‑changer.
Authors
- Sher Badshah
- Ali Emami
- Hassan Sajjad
Paper Information
- arXiv ID: 2602.13110v1
- Categories: cs.CL, cs.AI
- Published: February 13, 2026
- PDF: Download PDF