[Paper] Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

Published: (February 5, 2026 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06022v1

Overview

The paper presents CORAL (Correctness‑Optimized Residual Activation Lens), a lightweight inference‑time technique that nudges large language models (LLMs) toward more accurate and better‑calibrated answers without any extra training. By probing the hidden activations of a model with a regularized MLP, CORAL extracts distributed “correctness signals” and uses them to steer the model’s final prediction, delivering sizable gains in both accuracy and calibration for multiple‑choice QA tasks.

Key Contributions

  • Inference‑time steering focused on actual correctness rather than proxy objectives (e.g., likelihood or reward models).
  • Weight‑decay MLP probes that capture distributed correctness information from internal activations, avoiding reliance on single “magic neurons.”
  • Model‑agnostic and transferable: the same probes improve three distinct 7B‑parameter LLMs and generalize to four held‑out benchmarks without any retraining.
  • Significant empirical gains – average +10 % accuracy and –50 % expected calibration error (ECE) on in‑domain tests; +14 % accuracy and –49 % ECE on out‑of‑domain benchmarks.
  • Compute‑efficient solution: only a few forward passes through a small probe network, making it practical for production inference pipelines.

Methodology

  1. Collect activation snapshots – For each input (a multiple‑choice question), the hidden states from several layers of the base LLM are recorded.
  2. Train a regularized probe – A shallow MLP with strong weight decay (L2 regularization) is trained on a modest labeled set to predict whether a given answer choice is correct, using the collected activations as features. The heavy regularization forces the probe to rely on distributed patterns rather than memorizing individual neurons.
  3. Residual steering at inference – When the model processes a new question, the probe evaluates each answer candidate’s activation snapshot and produces a “correctness score.” This score is added (as a residual) to the model’s original logits before the softmax, effectively re‑ranking the choices toward those the probe deems more likely to be right.
  4. Calibration‑aware adjustment – Because the probe’s output is calibrated (trained with a proper loss such as cross‑entropy), the resulting logits inherit better confidence estimates, reducing ECE.

The whole pipeline requires no gradient updates to the base LLM, only a forward pass through the tiny probe.

Results & Findings

SettingAccuracy ΔECE Δ
In‑domain (same data used for probe training) – three 7B models+10 % avg.‑50 % avg.
Out‑of‑domain (four held‑out MCQA benchmarks)+14 % avg.‑49 % avg.
  • Consistency across architectures – The same probe design worked for three distinct 7B‑parameter models (e.g., LLaMA‑7B, Falcon‑7B, and an OpenAI‑style model).
  • Transferability – Probes trained on one benchmark (e.g., ARC‑Easy) still delivered improvements on completely different tasks (Math‑MC, HellaSwag).
  • Calibration – Expected Calibration Error dropped roughly by half, meaning the model’s confidence scores aligned much better with actual correctness.

The authors interpret these outcomes as evidence that correctness information is distributed across many hidden units, and a regularized probe can reliably extract it.

Practical Implications

  • Plug‑and‑play improvement: Deploy CORAL as a thin wrapper around any existing LLM inference service; no fine‑tuning or model weight changes required.
  • Cost‑effective scaling: Since the probe is tiny (a few hundred KB) and inference adds only marginal latency, large‑scale APIs can boost performance without extra GPU hours.
  • Better user experience: Lower ECE translates to more trustworthy confidence scores, which is crucial for downstream systems that act on model probabilities (e.g., automated tutoring, decision support).
  • Cross‑task robustness: Teams can train a single probe on a modest internal QA dataset and reap benefits across a suite of downstream MCQA benchmarks, reducing the need for task‑specific data collection.
  • Safety & alignment: Improved calibration helps mitigate over‑confident hallucinations, a common failure mode in instruction‑tuned LLMs.

Limitations & Future Work

  • Scope limited to multiple‑choice QA – The current experiments focus on MCQA; extending CORAL to open‑ended generation or other output formats remains an open question.
  • Probe training data requirement – While modest, the method still needs a labeled calibration set; performance may degrade if the set is too small or domain‑mismatched.
  • Potential for probe overfitting – Even with strong weight decay, probes could capture dataset‑specific quirks, so systematic evaluation on truly unseen domains is needed.
  • Future directions suggested by the authors include:
    1. Exploring hierarchical probes that operate across multiple layers simultaneously.
    2. Adapting the residual steering concept to token‑level generation.
    3. Integrating CORAL with reinforcement‑learning‑from‑human‑feedback pipelines to jointly improve correctness and alignment.

Authors

  • Miranda Muqing Miao
  • Young‑Min Cho
  • Lyle Ungar

Paper Information

  • arXiv ID: 2602.06022v1
  • Categories: cs.LG, cs.AI
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »