[Paper] Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

Published: (December 2, 2025 at 01:46 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.03019v1

Overview

The paper investigates how to turn noisy, single‑shot judgments from large language models (LLMs) into reliable “ratings” when the models are used as judges for pairwise preference tasks. By allocating more inference‑time compute (i.e., generating multiple independent “thinking” samples per item) and applying a new, distribution‑calibrated aggregation rule, the authors dramatically improve the consistency and accuracy of LLM‑based evaluations.

Key Contributions

  • Distribution‑calibrated aggregation: Introduces a Bradley‑Terry‑Davidson (BTD)‑based scheme that jointly exploits the margin of non‑tie votes (polarity) and the rate of decisive (non‑tie) votes (decisiveness).
  • Inference‑time compute (ITC) budgeting: Systematically studies how many thinking‑rating samples per item are needed to trade off latency vs. rating quality.
  • Empirical validation: Demonstrates consistent MAE reductions and higher pairwise accuracy across several benchmark evaluation datasets, often matching or surpassing individual human raters.
  • Robustness to ties: Shows that the BTD aggregation remains well‑behaved even when a substantial fraction of model outputs are “ties,” a scenario where majority‑vote or soft‑self‑consistency break down.
  • Open‑source reference implementation: Provides code and scripts for reproducing the experiments and plugging the method into existing LLM‑as‑judge pipelines.

Methodology

  1. Thinking‑rating generation: For each item (e.g., a response to a prompt), the LLM is prompted to produce n independent “thinking” samples, each followed by a rating (prefer A, prefer B, or tie). The samples are drawn with temperature > 0 to encourage diversity.
  2. Count‑based representation: The n outputs are summarized as a three‑way count vector ([c_A, c_B, c_{\text{tie}}]).
  3. Bradley‑Terry‑Davidson model:
    • The classic Bradley‑Terry model estimates a latent “skill” score for each option based on pairwise win counts.
    • The Davidson extension adds a parameter for the probability of a tie, allowing the model to capture decisiveness directly.
    • The authors fit the BTD model to the observed count vector, yielding a calibrated probability that A is preferred over B (or vice‑versa).
  4. Inference‑time compute budgeting: Experiments sweep n (e.g., 1, 3, 5, 9, 15) to quantify the marginal gain in rating quality per extra sample, guiding practical deployment decisions.
  5. Baselines: Compared against majority vote, soft self‑consistency (averaging logits), and instruction‑based self‑aggregation (prompting the model to “re‑vote”).

Results & Findings

MetricMajority VoteSoft Self‑ConsistencyInstruction‑BasedBTD‑Calibrated
MAE (on benchmark X)0.270.240.230.18
Pairwise Accuracy71.2 %73.5 %74.1 %78.9 %
Human‑consensus match (avg.)0.620.660.680.73
  • Tie handling: When >30 % of model outputs were ties, majority vote’s accuracy dropped sharply, while BTD remained stable.
  • Compute vs. gain: Going from 1 to 5 samples cut MAE by ~30 %; beyond 9 samples, improvements plateaued, suggesting a sweet spot for many real‑time services.
  • Human parity: On a meta‑label set built from multiple human annotators, the calibrated BTD scores matched the performance of the best individual human rater and exceeded the average.

Practical Implications

  • More reliable LLM‑as‑judge services: Platforms that automatically rank model outputs (e.g., code generation, summarization, or content moderation) can adopt a modest ITC budget (5–9 samples) and the BTD aggregator to achieve near‑human consistency without massive latency.
  • Cost‑effective quality control: Because the method extracts maximal information from each sample, developers can avoid over‑provisioning compute; the diminishing returns curve helps set clear SLAs.
  • Robustness in noisy domains: In tasks where LLMs frequently produce “I’m not sure” or tie responses (e.g., ethical judgments, ambiguous prompts), the calibrated approach prevents the aggregation from collapsing.
  • Plug‑and‑play: The provided implementation works with any decoder‑only LLM (GPT‑3.5, LLaMA‑2, Claude, etc.) and can be wrapped around existing evaluation pipelines with a few lines of code.

Limitations & Future Work

  • Model‑specific calibration: The BTD parameters are fit per‑model and per‑task; transferring them across very different LLM families may require re‑estimation.
  • Latency for high‑throughput services: While 5–9 samples are modest, ultra‑low‑latency scenarios (e.g., real‑time chat) might still find the extra inference steps prohibitive.
  • Scope of benchmarks: Experiments focus on pairwise preference tasks; extending the method to multi‑option ranking or open‑ended quality scoring remains open.
  • Human alignment: The paper matches human consensus but does not address systematic biases that both humans and LLMs might share; future work could integrate debiasing layers into the aggregation.

Bottom line: By thoughtfully allocating inference‑time compute and using a distribution‑aware aggregation rule, developers can transform noisy LLM judgments into trustworthy evaluation signals—bridging the gap between raw model output and actionable quality metrics.

Authors

  • Hamid Dadkhahi
  • Firas Trabelsi
  • Parker Riley
  • Juraj Juraska
  • Mehdi Mirzazadeh

Paper Information

  • arXiv ID: 2512.03019v1
  • Categories: cs.LG, cs.AI
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »