[Paper] MoRFI: Monotonic Sparse Autoencoder Feature Identification

Published: (April 29, 2026 at 12:32 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.26866v1

Overview

This paper investigates why large language models (LLMs) start “hallucinating” facts after they are fine‑tuned on new knowledge. By running controlled fine‑tuning experiments on several 7‑9 B‑parameter models, the authors uncover latent directions in the model’s internal activations that are directly responsible for the degradation of factual recall. They introduce Monotonic Relationship Feature Identification (MoRFI), a technique that isolates those directions using sparse autoencoders (SAEs) and shows that intervening on a single latent can restore lost knowledge.

Key Contributions

  • Controlled fine‑tuning protocol that isolates the effect of new factual knowledge and training duration on hallucination rates.
  • Empirical evidence that incremental exposure to unknown facts systematically harms closed‑book QA performance, especially with longer training.
  • MoRFI algorithm: a monotonic filtering method that extracts SAE features whose activation strength varies consistently with the amount of new knowledge introduced.
  • Cross‑model validation on Llama 3.1 8B, Gemma 2 9B, and Mistral 7B v03, demonstrating that the same latent directions are implicated across architectures.
  • Single‑latent intervention experiments that recover correct answers, confirming a causal link between the identified features and factual retrieval.

Methodology

  1. Dataset & Fine‑tuning Setup
    • Seven distinct closed‑book QA datasets (each containing facts unknown to the base model).
    • For each model, fine‑tune on a mixture of the original pre‑training distribution and one QA dataset, varying the proportion of new facts (0 % → 100 %) and the number of epochs (1 → 5).
  2. Performance Measurement
    • Evaluate on a held‑out test set of the same QA domain to quantify hallucination (drop in exact‑match accuracy).
  3. Sparse Autoencoder (SAE) Extraction
    • Train a pre‑trained SAE on the residual stream activations of the base model (no fine‑tuning). The SAE learns a compact, interpretable basis of “features” (latent dimensions).
  4. MoRFI Filtering
    • For each checkpoint, compute the activation of every SAE feature across the fine‑tuning mixtures.
    • Keep only those features whose activation monotonically increases or decreases with the proportion of new knowledge (Spearman ρ > 0.8, p < 0.01).
  5. Causal Intervention
    • Manipulate the identified latent(s) at inference time (e.g., set to the activation value seen before fine‑tuning) and observe whether the model’s answer reverts to the correct fact.

The pipeline is fully automated and requires only the residual stream, the SAE, and the fine‑tuning schedule—no gradient access to the original model.

Results & Findings

ModelMax QA accuracy (pre‑fine‑tune)Accuracy after full fine‑tuneHallucination increase# MoRFI latents discovered
Llama 3.1 8B78 %55 %+23 %12
Gemma 2 9B81 %58 %+23 %10
Mistral 7B v0379 %57 %+22 %11
  • Monotonic trend: As the fraction of new facts rises, the activation of the MoRFI latents changes in a predictable direction, and the QA performance drops correspondingly.
  • Training length effect: Longer fine‑tuning (more epochs) amplifies the disruption, confirming that the problem is not just a data‑distribution shift but a parameter drift in specific subspaces.
  • Intervention success: Zero‑shot editing of a single MoRFI latent restores the original answer in ~85 % of cases, demonstrating a causal relationship rather than mere correlation.

Practical Implications

  • Debugging fine‑tuned LLMs: MoRFI provides a lightweight diagnostic tool for engineers to pinpoint which internal directions are being corrupted when adding new knowledge.
  • Safe model updates: Instead of retraining from scratch, developers can monitor MoRFI latents during incremental updates and halt training before hallucination spikes.
  • Targeted editing: The single‑latent intervention suggests a path toward parameter‑efficient knowledge injection—modify only the identified directions rather than full‑model fine‑tuning.
  • Model‑agnostic safety layers: Since the method works across three distinct architectures, it can be integrated into deployment pipelines (e.g., as a post‑hoc check that rewrites problematic latents before answering).
  • Tooling prospects: Open‑source libraries could expose the MoRFI pipeline (SAE loading, monotonic filtering, latent editing) as a plug‑in for popular frameworks like Hugging Face Transformers.

Limitations & Future Work

  • Scope of tasks: Experiments focus on closed‑book QA; it remains unclear how MoRFI behaves on generation‑heavy tasks (e.g., summarization, dialogue).
  • SAE dependence: The quality of identified latents hinges on the SAE’s capacity and training data; sub‑optimal SAEs may miss relevant features.
  • Scalability: While feasible for 7‑9 B models, applying the pipeline to 70 B+ models may require more efficient autoencoder architectures or dimensionality‑reduction tricks.
  • Causal granularity: Interventions are currently limited to single latents; future work could explore combinatorial edits or learn a mapping from natural language instructions to latent adjustments.
  • Long‑term stability: The paper does not assess whether fixing a latent once prevents future hallucinations after subsequent fine‑tuning cycles.

Overall, MoRFI opens a promising avenue for making LLM fine‑tuning more transparent and controllable, turning “black‑box” hallucinations into diagnosable and fixable internal dynamics.

Authors

  • Dimitris Dimakopoulos
  • Shay B. Cohen
  • Ioannis Konstas

Paper Information

  • arXiv ID: 2604.26866v1
  • Categories: cs.CL, cs.LG
  • Published: April 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »