[Paper] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Published: 2 days ago (February 10, 2026 at 01:33 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10067v1

Overview

The paper introduces RLFR (Reinforcement Learning from Feature Rewards), a new way to teach large language models (LLMs) to avoid hallucinations by turning internal “feature detectors” into reward signals. By probing a model for signs of factual uncertainty and then using those signals as supervision, the authors show that a 12‑billion‑parameter model can become dramatically more reliable without sacrificing its overall capabilities.

Key Contributions

Feature‑based reward formulation: Proposes using interpretability‑derived features (e.g., “factuality confidence”) as scalable reward functions for RL, rather than hand‑crafted scalar rewards.
Probing framework for hallucination detection: Develops a lightweight probe that flags potentially fabricated claims in model outputs, providing the raw signal for the reward.
End‑to‑end RL pipeline (RLFR): Integrates the probe, reward computation, and policy fine‑tuning into a single reinforcement‑learning loop.
Empirical validation on Gemma‑3‑12B‑IT: Demonstrates a 58 % reduction in hallucination rate while keeping benchmark scores (e.g., MMLU, GSM‑8K) unchanged.
Test‑time compute savings: Shows that the same feature‑based signal can be used at inference time to decide when to intervene, cutting unnecessary computation.

Methodology

Feature Extraction via Probing
- A small, trainable classifier (the “probe”) is attached to hidden layers of the base LLM.
- The probe is trained on a curated dataset of factual vs. fabricated statements, learning to output a factuality score for each token or sentence.
Reward Construction
- The factuality score is transformed into a scalar reward: high scores → positive reward, low scores → penalty.
- Additional shaping terms (e.g., length regularization) keep the model from gaming the reward by truncating output.
Reinforcement Learning Loop
- The base model generates a completion.
- The probe evaluates the completion, producing the reward.
- Proximal Policy Optimization (PPO) updates the model’s policy to maximize expected reward, encouraging it to self‑correct or request clarification when uncertain.
Test‑time Intervention
- During inference, the same probe runs in real time.
- If the factuality score drops below a threshold, the model either (a) rewrites the segment, (b) asks the user for clarification, or (c) flags the output.

Results & Findings

Metric	Baseline (Gemma‑3‑12B‑IT)	RLFR‑fine‑tuned
Hallucination rate (synthetic fact‑checking suite)	1.00 (baseline)	0.42 (‑58 %)
MMLU (accuracy)	71.3 %	71.1 % (≈ unchanged)
GSM‑8K (numeric reasoning)	68.5 %	68.2 %
Average inference latency (per token)	1.0×	0.88× (thanks to early‑stop intervention)

What it means: The RLFR policy learns to recognize when it is unsure and either corrects itself or signals uncertainty, cutting hallucinations by more than half while leaving core performance untouched. Moreover, the same feature signal can be reused at inference time to avoid wasteful computation.

Practical Implications

Safer AI assistants: Deployments that need high factual reliability (e.g., medical triage bots, legal drafting tools) can integrate RLFR to automatically self‑audit answers.
Cost‑effective scaling: Because the probe is tiny compared to the full model, the reward computation adds negligible overhead, making the approach viable for production‑scale services.
Plug‑and‑play supervision: Teams can reuse existing interpretability probes (e.g., toxicity, bias detectors) as reward functions for other open‑ended behaviors, turning interpretability research into a direct training asset.
Dynamic inference control: The test‑time intervention mechanism enables “on‑the‑fly” quality gating, allowing APIs to return a confidence flag or request clarification without a separate post‑processing step.

Limitations & Future Work

Probe quality dependence: The RLFR pipeline’s success hinges on the probe’s ability to correctly flag hallucinations; noisy or biased probes could misguide the policy.
Domain transfer: The current experiments focus on general‑purpose QA; extending to highly specialized domains (e.g., scientific literature) may require domain‑specific probing data.
Reward shaping complexity: Simple factuality scores work for hallucination reduction, but more nuanced tasks (e.g., creativity vs. accuracy trade‑offs) may need multi‑objective reward designs.
Scalability to larger models: While demonstrated on a 12 B parameter model, it remains an open question how the approach scales to 100 B+ models where probe insertion and PPO stability become more challenging.

Future directions include building a library of reusable feature‑based rewards (bias, privacy, energy consumption), exploring hierarchical RL where multiple feature rewards are balanced, and testing RLFR on multimodal models where visual‑textual consistency is critical.

Authors

Aaditya Vikram Prasad
Connor Watts
Jack Merullo
Dhruvil Gala
Owen Lewis
Thomas McGrath
Ekdeep Singh Lubana

Paper Information

arXiv ID: 2602.10067v1
Categories: cs.LG
Published: February 10, 2026
PDF: Download PDF

[Paper] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

[Paper] Agentic Test-Time Scaling for WebAgents