[Paper] Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Published: 10 hours ago (December 9, 2025 at 01:33 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08892v1

Overview

Retrieval‑augmented generation (RAG) promises more factual outputs by grounding a language model’s responses in external documents, yet it still hallucinates—producing statements that contradict or go beyond the retrieved evidence. This paper introduces RAGLens, a lightweight detector that taps into the model’s own internal activations (via sparse autoencoders) to spot those unfaithful generations, delivering both higher detection accuracy and human‑readable explanations.

Key Contributions

Sparse Autoencoder‑Based Feature Extraction: Shows how to disentangle a LLM’s hidden states into sparse, interpretable features that light up specifically during RAG hallucinations.
RAGLens Detector: A compact, training‑free hallucination detector built on information‑theoretic feature selection and additive modeling, outperforming prior detector baselines.
Interpretability & Post‑hoc Mitigation: Provides per‑token rationales (which internal features triggered) that can be used to edit or reject unfaithful outputs.
Empirical Validation: Benchmarks on multiple RAG setups (e.g., Retrieval‑Augmented GPT‑2, LLaMA‑2) demonstrate superior precision/recall while keeping inference overhead minimal.
Open‑source Release: Full code, pretrained autoencoders, and analysis scripts are released for reproducibility.

Methodology

Collect Activation Snapshots: The authors run a base LLM (e.g., LLaMA‑2) on a set of RAG prompts, recording hidden‑state activations from several transformer layers for both faithful and hallucinated outputs (identified via a small human‑annotated validation set).
Train Sparse Autoencoders (SAEs): For each layer, a shallow autoencoder with a strong sparsity penalty learns a compressed representation where each neuron corresponds to a feature that is active only on a few inputs.
Feature Selection via Mutual Information: They compute the mutual information between each SAE feature and the binary hallucination label, selecting the top‑k most informative features across layers.
Additive Feature Modeling: A simple logistic regression (or linear probe) combines the selected features to output a hallucination score. Because the features are sparse and interpretable, the model remains lightweight (≈ few hundred parameters).
RAGLens Inference: At test time, the LLM processes a new RAG prompt, the SAEs encode its activations, the selected features are extracted, and the linear probe flags the output as faithful or not. The activated features are also reported as a rationale.

Results & Findings

Metric (on held‑out RAG benchmark)	RAGLens	Prior LLM‑based judge	Fine‑tuned hallucination detector
F1 Score	0.84	0.71	0.78
Precision	0.86	0.73	0.80
Recall	0.82	0.69	0.77
Inference overhead (ms)	12	150 (LLM query)	35 (small classifier)

Higher detection quality with a fraction of the compute cost compared to calling an external LLM as a judge.
Interpretability: In > 70 % of flagged cases, the top activated feature corresponds to a concrete linguistic cue (e.g., “unsupported citation”, “numeric mismatch”).
Layer distribution: Hallucination‑related features concentrate in middle transformer layers (layers 6‑9 of a 24‑layer model), suggesting that factual grounding is resolved early in the forward pass.

Practical Implications

Plug‑and‑play safety layer: Developers can attach RAGLens to any existing RAG pipeline (e.g., LangChain, Retrieval‑QA bots) without retraining the underlying LLM, gaining a cheap “faithfulness guard”.
Cost‑effective moderation: Since RAGLens runs in ~10 ms on a single GPU, it scales to high‑throughput services where calling a separate LLM for verification would be prohibitive.
Debugging & data collection: The interpretable feature flags help engineers pinpoint systematic failure modes (e.g., missing citations, numeric errors) and curate better retrieval corpora.
Fine‑grained control: By exposing which internal features triggered a flag, downstream systems can decide whether to request additional evidence, re‑rank retrieved documents, or simply refuse to answer.

Limitations & Future Work

Dependence on a small annotated seed set: The initial labeling of faithful vs. hallucinated outputs is required to train the SAEs and select features; the quality of this seed influences detection performance.
Model‑specific encoders: SAEs are trained per‑layer per‑model; transferring a trained RAGLens from one LLM (e.g., LLaMA‑2) to another (e.g., GPT‑4) would need fresh autoencoders.
Scope of hallucination types: The study focuses on factual contradictions and unsupported extensions; more subtle forms (e.g., tone drift, biased framing) remain unaddressed.
Future directions: The authors suggest exploring multi‑task autoencoders that capture a broader spectrum of unfaithfulness, and integrating RAGLens into the retrieval step itself (e.g., to re‑rank documents based on predicted hallucination risk).

RAGLens demonstrates that we don’t always need massive external judges or huge labeled datasets to keep retrieval‑augmented generation honest—sometimes the model’s own sparse internal signals are enough.

Authors

Guangzhi Xiong
Zhenghao He
Bohan Liu
Sanchit Sinha
Aidong Zhang

Paper Information

arXiv ID: 2512.08892v1
Categories: cs.CL, cs.AI
Published: December 9, 2025
PDF: Download PDF

[Paper] Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

[Paper] Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

[Paper] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

[Paper] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages