[Paper] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Published: (February 10, 2026 at 01:33 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.10067v1

Overview

The paper introduces RLFR (Reinforcement Learning from Feature Rewards), a new way to teach large language models (LLMs) to avoid hallucinations by turning internal “feature detectors” into reward signals. By probing a model for signs of factual uncertainty and then using those signals as supervision, the authors show that a 12‑billion‑parameter model can become dramatically more reliable without sacrificing its overall capabilities.

Key Contributions

  • Feature‑based reward formulation: Proposes using interpretability‑derived features (e.g., “factuality confidence”) as scalable reward functions for RL, rather than hand‑crafted scalar rewards.
  • Probing framework for hallucination detection: Develops a lightweight probe that flags potentially fabricated claims in model outputs, providing the raw signal for the reward.
  • End‑to‑end RL pipeline (RLFR): Integrates the probe, reward computation, and policy fine‑tuning into a single reinforcement‑learning loop.
  • Empirical validation on Gemma‑3‑12B‑IT: Demonstrates a 58 % reduction in hallucination rate while keeping benchmark scores (e.g., MMLU, GSM‑8K) unchanged.
  • Test‑time compute savings: Shows that the same feature‑based signal can be used at inference time to decide when to intervene, cutting unnecessary computation.

Methodology

  1. Feature Extraction via Probing

    • A small, trainable classifier (the “probe”) is attached to hidden layers of the base LLM.
    • The probe is trained on a curated dataset of factual vs. fabricated statements, learning to output a factuality score for each token or sentence.
  2. Reward Construction

    • The factuality score is transformed into a scalar reward: high scores → positive reward, low scores → penalty.
    • Additional shaping terms (e.g., length regularization) keep the model from gaming the reward by truncating output.
  3. Reinforcement Learning Loop

    • The base model generates a completion.
    • The probe evaluates the completion, producing the reward.
    • Proximal Policy Optimization (PPO) updates the model’s policy to maximize expected reward, encouraging it to self‑correct or request clarification when uncertain.
  4. Test‑time Intervention

    • During inference, the same probe runs in real time.
    • If the factuality score drops below a threshold, the model either (a) rewrites the segment, (b) asks the user for clarification, or (c) flags the output.

Results & Findings

MetricBaseline (Gemma‑3‑12B‑IT)RLFR‑fine‑tuned
Hallucination rate (synthetic fact‑checking suite)1.00 (baseline)0.42 (‑58 %)
MMLU (accuracy)71.3 %71.1 % (≈ unchanged)
GSM‑8K (numeric reasoning)68.5 %68.2 %
Average inference latency (per token)1.0×0.88× (thanks to early‑stop intervention)

What it means: The RLFR policy learns to recognize when it is unsure and either corrects itself or signals uncertainty, cutting hallucinations by more than half while leaving core performance untouched. Moreover, the same feature signal can be reused at inference time to avoid wasteful computation.

Practical Implications

  • Safer AI assistants: Deployments that need high factual reliability (e.g., medical triage bots, legal drafting tools) can integrate RLFR to automatically self‑audit answers.
  • Cost‑effective scaling: Because the probe is tiny compared to the full model, the reward computation adds negligible overhead, making the approach viable for production‑scale services.
  • Plug‑and‑play supervision: Teams can reuse existing interpretability probes (e.g., toxicity, bias detectors) as reward functions for other open‑ended behaviors, turning interpretability research into a direct training asset.
  • Dynamic inference control: The test‑time intervention mechanism enables “on‑the‑fly” quality gating, allowing APIs to return a confidence flag or request clarification without a separate post‑processing step.

Limitations & Future Work

  • Probe quality dependence: The RLFR pipeline’s success hinges on the probe’s ability to correctly flag hallucinations; noisy or biased probes could misguide the policy.
  • Domain transfer: The current experiments focus on general‑purpose QA; extending to highly specialized domains (e.g., scientific literature) may require domain‑specific probing data.
  • Reward shaping complexity: Simple factuality scores work for hallucination reduction, but more nuanced tasks (e.g., creativity vs. accuracy trade‑offs) may need multi‑objective reward designs.
  • Scalability to larger models: While demonstrated on a 12 B parameter model, it remains an open question how the approach scales to 100 B+ models where probe insertion and PPO stability become more challenging.

Future directions include building a library of reusable feature‑based rewards (bias, privacy, energy consumption), exploring hierarchical RL where multiple feature rewards are balanced, and testing RLFR on multimodal models where visual‑textual consistency is critical.

Authors

  • Aaditya Vikram Prasad
  • Connor Watts
  • Jack Merullo
  • Dhruvil Gala
  • Owen Lewis
  • Thomas McGrath
  • Ekdeep Singh Lubana

Paper Information

  • arXiv ID: 2602.10067v1
  • Categories: cs.LG
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »