[Paper] MoRFI: Monotonic Sparse Autoencoder Feature Identification

Published: 5 days ago (April 29, 2026 at 12:32 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.26866v1

Overview

This paper investigates why large language models (LLMs) start “hallucinating” facts after they are fine‑tuned on new knowledge. By running controlled fine‑tuning experiments on several 7‑9 B‑parameter models, the authors uncover latent directions in the model’s internal activations that are directly responsible for the degradation of factual recall. They introduce Monotonic Relationship Feature Identification (MoRFI), a technique that isolates those directions using sparse autoencoders (SAEs) and shows that intervening on a single latent can restore lost knowledge.

Key Contributions

Controlled fine‑tuning protocol that isolates the effect of new factual knowledge and training duration on hallucination rates.
Empirical evidence that incremental exposure to unknown facts systematically harms closed‑book QA performance, especially with longer training.
MoRFI algorithm: a monotonic filtering method that extracts SAE features whose activation strength varies consistently with the amount of new knowledge introduced.
Cross‑model validation on Llama 3.1 8B, Gemma 2 9B, and Mistral 7B v03, demonstrating that the same latent directions are implicated across architectures.
Single‑latent intervention experiments that recover correct answers, confirming a causal link between the identified features and factual retrieval.

Methodology

Dataset & Fine‑tuning Setup
- Seven distinct closed‑book QA datasets (each containing facts unknown to the base model).
- For each model, fine‑tune on a mixture of the original pre‑training distribution and one QA dataset, varying the proportion of new facts (0 % → 100 %) and the number of epochs (1 → 5).
Performance Measurement
- Evaluate on a held‑out test set of the same QA domain to quantify hallucination (drop in exact‑match accuracy).
Sparse Autoencoder (SAE) Extraction
- Train a pre‑trained SAE on the residual stream activations of the base model (no fine‑tuning). The SAE learns a compact, interpretable basis of “features” (latent dimensions).
MoRFI Filtering
- For each checkpoint, compute the activation of every SAE feature across the fine‑tuning mixtures.
- Keep only those features whose activation monotonically increases or decreases with the proportion of new knowledge (Spearman ρ > 0.8, p < 0.01).
Causal Intervention
- Manipulate the identified latent(s) at inference time (e.g., set to the activation value seen before fine‑tuning) and observe whether the model’s answer reverts to the correct fact.

The pipeline is fully automated and requires only the residual stream, the SAE, and the fine‑tuning schedule—no gradient access to the original model.

Results & Findings

Model	Max QA accuracy (pre‑fine‑tune)	Accuracy after full fine‑tune	Hallucination increase	# MoRFI latents discovered
Llama 3.1 8B	78 %	55 %	+23 %	12
Gemma 2 9B	81 %	58 %	+23 %	10
Mistral 7B v03	79 %	57 %	+22 %	11

Monotonic trend: As the fraction of new facts rises, the activation of the MoRFI latents changes in a predictable direction, and the QA performance drops correspondingly.
Training length effect: Longer fine‑tuning (more epochs) amplifies the disruption, confirming that the problem is not just a data‑distribution shift but a parameter drift in specific subspaces.
Intervention success: Zero‑shot editing of a single MoRFI latent restores the original answer in ~85 % of cases, demonstrating a causal relationship rather than mere correlation.

Practical Implications

Debugging fine‑tuned LLMs: MoRFI provides a lightweight diagnostic tool for engineers to pinpoint which internal directions are being corrupted when adding new knowledge.
Safe model updates: Instead of retraining from scratch, developers can monitor MoRFI latents during incremental updates and halt training before hallucination spikes.
Targeted editing: The single‑latent intervention suggests a path toward parameter‑efficient knowledge injection—modify only the identified directions rather than full‑model fine‑tuning.
Model‑agnostic safety layers: Since the method works across three distinct architectures, it can be integrated into deployment pipelines (e.g., as a post‑hoc check that rewrites problematic latents before answering).
Tooling prospects: Open‑source libraries could expose the MoRFI pipeline (SAE loading, monotonic filtering, latent editing) as a plug‑in for popular frameworks like Hugging Face Transformers.

Limitations & Future Work

Scope of tasks: Experiments focus on closed‑book QA; it remains unclear how MoRFI behaves on generation‑heavy tasks (e.g., summarization, dialogue).
SAE dependence: The quality of identified latents hinges on the SAE’s capacity and training data; sub‑optimal SAEs may miss relevant features.
Scalability: While feasible for 7‑9 B models, applying the pipeline to 70 B+ models may require more efficient autoencoder architectures or dimensionality‑reduction tricks.
Causal granularity: Interventions are currently limited to single latents; future work could explore combinatorial edits or learn a mapping from natural language instructions to latent adjustments.
Long‑term stability: The paper does not assess whether fixing a latent once prevents future hallucinations after subsequent fine‑tuning cycles.

Overall, MoRFI opens a promising avenue for making LLM fine‑tuning more transparent and controllable, turning “black‑box” hallucinations into diagnosable and fixable internal dynamics.

Authors

Dimitris Dimakopoulos
Shay B. Cohen
Ioannis Konstas

Paper Information

arXiv ID: 2604.26866v1
Categories: cs.CL, cs.LG
Published: April 29, 2026
PDF: Download PDF

[Paper] MoRFI: Monotonic Sparse Autoencoder Feature Identification

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Can Coding Agents Reproduce Findings in Computational Materials Science?

[Paper] RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

[Paper] When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

[Paper] Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media