[Paper] Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Published: 1 week ago (June 3, 2026 at 01:27 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.05122v1

Overview

The paper shows that large language models (LLMs) already possess a hidden ability to predict how an external judge would score their own outputs—without any dedicated fine‑tuning. By applying a lightweight “self‑evaluation elicitation” (SEE) pipeline, the authors coax this latent skill out of a base model, dramatically improving calibration (i.e., the match between model confidence and actual quality) while keeping the generated answers just as good.

Key Contributions

Discovery of latent self‑evaluation – Vanilla base LLMs can forecast multi‑attribute quality scores from external judges far better than random chance.
Self‑Evaluation Elicitation (SEE) – A two‑step, data‑efficient procedure (calibration‑coupled RL + masked distillation) that amplifies the model’s self‑scoring ability.
Data efficiency – Achieves comparable or superior calibration to a full reinforcement‑learning‑from‑human‑feedback (RLHF) baseline using only ~31× fewer examples (≈160 vs. ~5 000).
Judge‑agnostic transfer – The elicited self‑evaluation generalizes to judges the model has never seen, suggesting a shared notion of answer quality rather than over‑fitting to a single preference.
Localized prediction – The self‑evaluation signal is tightly bound to the model’s own token distribution, making it easy to extract without altering the answer text.

Methodology

Baseline probing – Prompt a vanilla LLM with a few examples (few‑shot) and ask it to predict the scores a separate “judge” model would assign to its own responses. Even with this minimal setup, the predictions are significantly correlated with the true scores.
Self‑Evaluation Elicitation (SEE)
- Calibration‑coupled RL phase
  - The model generates an answer to a task.
  - A small reinforcement‑learning step (using a reward that mixes answer quality and a self‑evaluation loss) nudges the model to produce an answer that it can also score accurately.
- Masked distillation phase
  - The answer tokens are masked (kept unchanged).
  - The model is trained to sharpen its own score prediction on the masked answer, effectively “distilling” the calibration knowledge without touching the answer itself.
Evaluation – Test the pipeline on three open‑ended benchmarks (e.g., summarization, reasoning, dialogue) with multiple external judges. Calibration is measured by how well predicted scores align with actual judge scores, while answer quality is assessed with standard metrics (ROUGE, accuracy, etc.).

Results & Findings

Metric	Baseline (few‑shot)	RLHF (full data)	SEE (≈160 examples)
Calibration (Pearson r)	0.42 – 0.55	0.68 – 0.71	0.66 – 0.70
Answer quality (task‑specific)	Comparable to RLHF	Slightly higher	On par
Data used	–	~5 000 examples	≈160

Calibration boost – SEE closes the gap to full RLHF, delivering a ~30 % improvement over the few‑shot baseline.
Answer preservation – Because the masked distillation never edits the generated text, answer quality remains unchanged.
Judge transfer – When evaluated against a judge that was not part of the training loop, SEE’s self‑scores still correlate strongly (r ≈ 0.63), confirming a judge‑agnostic quality signal.
Localization – Ablation experiments reveal that the self‑evaluation signal lives in the same token probability space the model uses for generation, making it cheap to extract at inference time.

Practical Implications

Cost‑effective calibration – Developers can dramatically reduce the amount of human‑annotated data needed to align LLM outputs with quality metrics—ideal for startups or teams with limited annotation budgets.
On‑the‑fly self‑assessment – Since the self‑score can be computed without altering the answer, services can expose a confidence or quality score alongside each response, helping downstream systems (e.g., ranking, reranking, safety filters) make better decisions.
Judge‑agnostic safety layers – Because the self‑evaluation generalizes across judges, a single SEE‑enhanced model can serve multiple downstream evaluators (e.g., toxicity, factuality, relevance) without retraining for each.
Plug‑and‑play upgrade – Existing LLM deployments can be retrofitted with SEE by adding a short fine‑tuning step (≈160 examples) rather than a full RLHF pipeline, accelerating time‑to‑market for calibrated AI products.
Better user experience – UI designers can display a “quality meter” derived from the model’s own evaluation, giving users transparent feedback about answer reliability.

Limitations & Future Work

Scope of tasks – Experiments focus on three open‑ended benchmarks; performance on highly structured tasks (e.g., code generation, math proofs) remains unknown.
Judge diversity – Although the method generalizes to unseen judges, the evaluated set is still limited; broader coverage (e.g., domain‑specific experts) could reveal edge cases.
Potential bias propagation – Because the self‑evaluation is rooted in the model’s own token distribution, systematic biases in the base model may be reflected in its quality scores.
Scalability to larger models – The paper uses base‑size LLMs; applying SEE to multi‑billion‑parameter models may require adjustments to RL and distillation hyper‑parameters.
Future directions – The authors suggest exploring multi‑judge ensembles, integrating SEE with retrieval‑augmented generation, and extending the masked distillation idea to other latent abilities (e.g., factuality detection).

Authors

XiuYu Zhang
Yi Shan
Junfeng Fang
Zhenkai Liang

Paper Information

arXiv ID: 2606.05122v1
Categories: cs.CL
Published: June 3 2026
PDF: Download PDF

[Paper] Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] Agentopia: Long-Term Life Simulation and Learning in Agent Societies

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings