[Paper] Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Published: (June 3, 2026 at 01:27 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.05122v1

Overview

The paper shows that large language models (LLMs) already possess a hidden ability to predict how an external judge would score their own outputs—without any dedicated fine‑tuning. By applying a lightweight “self‑evaluation elicitation” (SEE) pipeline, the authors coax this latent skill out of a base model, dramatically improving calibration (i.e., the match between model confidence and actual quality) while keeping the generated answers just as good.

Key Contributions

  • Discovery of latent self‑evaluation – Vanilla base LLMs can forecast multi‑attribute quality scores from external judges far better than random chance.
  • Self‑Evaluation Elicitation (SEE) – A two‑step, data‑efficient procedure (calibration‑coupled RL + masked distillation) that amplifies the model’s self‑scoring ability.
  • Data efficiency – Achieves comparable or superior calibration to a full reinforcement‑learning‑from‑human‑feedback (RLHF) baseline using only ~31× fewer examples (≈160 vs. ~5 000).
  • Judge‑agnostic transfer – The elicited self‑evaluation generalizes to judges the model has never seen, suggesting a shared notion of answer quality rather than over‑fitting to a single preference.
  • Localized prediction – The self‑evaluation signal is tightly bound to the model’s own token distribution, making it easy to extract without altering the answer text.

Methodology

  1. Baseline probing – Prompt a vanilla LLM with a few examples (few‑shot) and ask it to predict the scores a separate “judge” model would assign to its own responses. Even with this minimal setup, the predictions are significantly correlated with the true scores.

  2. Self‑Evaluation Elicitation (SEE)

    • Calibration‑coupled RL phase
      • The model generates an answer to a task.
      • A small reinforcement‑learning step (using a reward that mixes answer quality and a self‑evaluation loss) nudges the model to produce an answer that it can also score accurately.
    • Masked distillation phase
      • The answer tokens are masked (kept unchanged).
      • The model is trained to sharpen its own score prediction on the masked answer, effectively “distilling” the calibration knowledge without touching the answer itself.
  3. Evaluation – Test the pipeline on three open‑ended benchmarks (e.g., summarization, reasoning, dialogue) with multiple external judges. Calibration is measured by how well predicted scores align with actual judge scores, while answer quality is assessed with standard metrics (ROUGE, accuracy, etc.).

Results & Findings

MetricBaseline (few‑shot)RLHF (full data)SEE (≈160 examples)
Calibration (Pearson r)0.42 – 0.550.68 – 0.710.66 – 0.70
Answer quality (task‑specific)Comparable to RLHFSlightly higherOn par
Data used~5 000 examples≈160
  • Calibration boost – SEE closes the gap to full RLHF, delivering a ~30 % improvement over the few‑shot baseline.
  • Answer preservation – Because the masked distillation never edits the generated text, answer quality remains unchanged.
  • Judge transfer – When evaluated against a judge that was not part of the training loop, SEE’s self‑scores still correlate strongly (r ≈ 0.63), confirming a judge‑agnostic quality signal.
  • Localization – Ablation experiments reveal that the self‑evaluation signal lives in the same token probability space the model uses for generation, making it cheap to extract at inference time.

Practical Implications

  • Cost‑effective calibration – Developers can dramatically reduce the amount of human‑annotated data needed to align LLM outputs with quality metrics—ideal for startups or teams with limited annotation budgets.
  • On‑the‑fly self‑assessment – Since the self‑score can be computed without altering the answer, services can expose a confidence or quality score alongside each response, helping downstream systems (e.g., ranking, reranking, safety filters) make better decisions.
  • Judge‑agnostic safety layers – Because the self‑evaluation generalizes across judges, a single SEE‑enhanced model can serve multiple downstream evaluators (e.g., toxicity, factuality, relevance) without retraining for each.
  • Plug‑and‑play upgrade – Existing LLM deployments can be retrofitted with SEE by adding a short fine‑tuning step (≈160 examples) rather than a full RLHF pipeline, accelerating time‑to‑market for calibrated AI products.
  • Better user experience – UI designers can display a “quality meter” derived from the model’s own evaluation, giving users transparent feedback about answer reliability.

Limitations & Future Work

  • Scope of tasks – Experiments focus on three open‑ended benchmarks; performance on highly structured tasks (e.g., code generation, math proofs) remains unknown.
  • Judge diversity – Although the method generalizes to unseen judges, the evaluated set is still limited; broader coverage (e.g., domain‑specific experts) could reveal edge cases.
  • Potential bias propagation – Because the self‑evaluation is rooted in the model’s own token distribution, systematic biases in the base model may be reflected in its quality scores.
  • Scalability to larger models – The paper uses base‑size LLMs; applying SEE to multi‑billion‑parameter models may require adjustments to RL and distillation hyper‑parameters.
  • Future directions – The authors suggest exploring multi‑judge ensembles, integrating SEE with retrieval‑augmented generation, and extending the masked distillation idea to other latent abilities (e.g., factuality detection).

Authors

  • XiuYu Zhang
  • Yi Shan
  • Junfeng Fang
  • Zhenkai Liang

Paper Information

  • arXiv ID: 2606.05122v1
  • Categories: cs.CL
  • Published: June 3 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »