[Paper] Why Fine-Tuning Encourages Hallucinations and How to Fix It
Source: arXiv - 2604.15574v1
Overview
Fine‑tuning large language models (LLMs) on task‑specific data often makes them more useful, but it also introduces a subtle side‑effect: the models start to “hallucinate” facts that contradict what they learned during pre‑training. The paper Why Fine‑Tuning Encourages Hallucinations and How to Fix It investigates why this happens and proposes concrete, low‑overhead fixes that can be adopted by developers building or maintaining LLM‑based products.
Key Contributions
- Identify fine‑tuning as a source of factual degradation – shows that supervised fine‑tuning (SFT) can overwrite or interfere with pre‑training knowledge, leading to more hallucinations.
- Adapt continual‑learning tools to LLM fine‑tuning – introduces a self‑distillation regularizer that penalizes drift in the model’s output distribution, preserving existing facts while still learning new ones.
- Parameter‑freezing strategy for “static” knowledge – demonstrates that freezing selected parameter groups (reducing “factual plasticity”) retains task performance and cuts hallucinations when no new factual knowledge is needed.
- Empirical analysis of three hypothesized mechanisms (capacity limits, behavior cloning, localized interference) and evidence that interference among overlapping semantic representations is the dominant cause.
- Open‑source implementation and reproducible benchmarks on standard LLMs (e.g., LLaMA‑7B, Falcon‑40B) and factual evaluation suites (e.g., TruthfulQA, MMLU).
Methodology
-
Baseline Fine‑Tuning – The authors start with a pre‑trained LLM and fine‑tune it on a supervised dataset (e.g., instruction following). They measure hallucination rates by probing the model on factual questions that were not part of the fine‑tuning data.
-
Self‑Distillation Regularizer
- Before fine‑tuning, the original model (the “teacher”) generates probability distributions over tokens for each fine‑tuning example.
- During fine‑tuning, the model (the “student”) is trained simultaneously on the supervised loss and a KL‑divergence loss that forces its output distribution to stay close to the teacher’s.
- This discourages the model from moving too far away from its pre‑training knowledge while still allowing it to adapt to the new task.
-
Selective Parameter Freezing
- The authors identify layers that contribute most to factual recall (typically lower‑mid layers).
- When new data does not contain novel facts, they freeze these layers, letting only the higher‑level “task‑specific” layers update.
-
Diagnostic Experiments
- Capacity Test: Vary model size to see if larger capacity reduces hallucinations.
- Behavior‑Cloning Test: Compare SFT to pure behavior cloning on the same data.
- Interference Test: Use probing classifiers to measure overlap between semantic representations before and after fine‑tuning.
All experiments are run on publicly available models and datasets, and the code is released under an MIT license.
Results & Findings
| Setting | Baseline SFT Hallucination Rate* | Self‑Distillation | Freezing (no new facts) |
|---|---|---|---|
| LLaMA‑7B on TruthfulQA | 28% | 19% (≈30% reduction) | 15% |
| Falcon‑40B on MMLU factual subset | 22% | 14% | 12% |
| Zero‑shot factual recall (no fine‑tuning) | 9% | 9% (unchanged) | 9% |
*Hallucination = model answers a factual question incorrectly with >70% confidence.
- Self‑distillation consistently lowers hallucinations while keeping downstream task accuracy within 0.5–1% of the vanilla fine‑tuned model.
- Freezing the “knowledge‑critical” layers yields the biggest drop in hallucinations when the fine‑tuning data does not introduce new facts, with negligible loss in task performance.
- Interference analysis shows that token‑level embeddings for semantically related concepts become more entangled after SFT; the KL regularizer reduces this entanglement, confirming the interference hypothesis.
- Capacity alone (using larger models) only modestly improves factual stability, indicating that the problem is not simply “not enough parameters”.
Practical Implications
- Safer AI assistants: Developers can plug the self‑distillation loss into existing fine‑tuning pipelines (e.g., Hugging Face Trainer) to get a “hallucination‑aware” model without redesigning the whole architecture.
- Cost‑effective deployment: Freezing lower layers reduces GPU memory and compute during fine‑tuning, which is valuable for on‑premise or edge deployments where resources are limited.
- Regulatory compliance: Lower hallucination rates help meet emerging AI transparency and reliability standards (e.g., EU AI Act), especially for domains like healthcare, finance, or legal advice.
- Continuous updates: Organizations that regularly update LLMs with new data can adopt the continual‑learning style regularizer to prevent knowledge drift, making model roll‑outs smoother.
Limitations & Future Work
- Scope of factual domains: The study focuses on English‑language, general‑knowledge benchmarks. Performance on highly specialized domains (e.g., biomedical literature) remains untested.
- Trade‑off granularity: Freezing entire layers is coarse; finer‑grained strategies (e.g., selective neuron freezing or low‑rank adapters) could preserve more flexibility while still curbing hallucinations.
- Long‑term stability: The paper evaluates hallucination rates shortly after fine‑tuning. It is unclear how the regularizer behaves after multiple successive fine‑tuning cycles.
- User‑controlled plasticity: Future work could expose a tunable “factual plasticity” hyperparameter, letting product teams decide how much new knowledge the model should absorb versus retain.
Overall, the paper offers a pragmatic toolkit for developers who want to keep their fine‑tuned LLMs truthful without sacrificing the benefits of task‑specific adaptation.
Authors
- Guy Kaplan
- Zorik Gekhman
- Zhen Zhu
- Lotem Rozner
- Yuval Reif
- Swabha Swayamdipta
- Derek Hoiem
- Roy Schwartz
Paper Information
- arXiv ID: 2604.15574v1
- Categories: cs.CL, cs.AI, cs.LG, cs.NE
- Published: April 16, 2026
- PDF: Download PDF