[Paper] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Source: arXiv - 2601.05201v1
Overview
Large vision‑language models (VLMs) can answer open‑ended questions about images, but they sometimes hallucinate—they repeat or “copy” the wording of a textual prompt even when the visual evidence contradicts it. This paper investigates why that happens, using a simple object‑counting task to expose the phenomenon and pinpoint the internal components responsible.
Key Contributions
- Controlled experimental setup: Introduces a clean object‑counting benchmark where prompts deliberately overstate the number of objects, making hallucinations easy to detect.
- Mechanistic discovery: Identifies a small set of attention heads (the “PIH‑heads”) whose ablation cuts prompt‑induced hallucinations (PIH) by ≥ 40 % across three state‑of‑the‑art VLMs, without any extra training.
- Model‑specific analysis: Shows that the same heads behave differently in each architecture, revealing distinct ways that prompt copying is implemented.
- Empirical validation: Demonstrates that removing PIH‑heads nudges the model toward relying on visual evidence, improving count accuracy especially for higher object numbers.
- Open‑source tooling: Provides code for the counting benchmark and head‑ablation experiments, enabling reproducibility and further exploration.
Methodology
- Task design – Images contain a known number of identical objects (e.g., waterlilies). The prompt asks the model to “describe N objects,” where N is greater than the true count.
- Models evaluated – Three popular VLMs (a CLIP‑based encoder‑decoder, a BLIP‑style model, and a Flamingo‑inspired architecture).
- Prompt‑induced hallucination metric – The model’s output is parsed for the numeric count it mentions; a hallucination occurs when this count matches the inflated prompt rather than the visual ground truth.
- Attention‑head probing – Using gradient‑based attribution and causal mediation analysis, the authors locate heads whose activations correlate strongly with the hallucinated count.
- Ablation experiments – Those heads are zero‑ed out at inference time, and the impact on hallucination rate and overall answer quality is measured.
The approach is deliberately lightweight: no fine‑tuning, just a targeted “surgical” removal of a handful of attention heads.
Results & Findings
| Model | Baseline PIH rate (high count) | PIH rate after head ablation | Accuracy gain |
|---|---|---|---|
| CLIP‑Encoder‑Decoder | 68 % | 38 % | +12 % correct counts |
| BLIP‑style | 71 % | 34 % | +15 % correct counts |
| Flamingo‑like | 65 % | 31 % | +13 % correct counts |
- Head count: Only 3–5 heads per model needed to be removed to achieve the reported drop.
- Prompt copying mechanism:
- CLIP‑based models: heads act as a shortcut that directly injects the numeric token from the prompt into the decoder’s language stream.
- BLIP: heads amplify the prompt embedding before cross‑attention.
- Flamingo: heads bias the visual‑to‑text fusion layer.
- No side‑effects: General language fluency and image‑captioning quality remain largely unchanged, confirming that the heads are specialized for the hallucination pathway.
Practical Implications
- Debugging VLMs: Developers can instrument their models to monitor the activity of identified PIH‑heads, using it as an early warning signal for hallucination‑prone queries.
- Lightweight mitigation: Instead of costly fine‑tuning or reinforcement learning from human feedback, a simple inference‑time head mask can be deployed in production pipelines to improve reliability on tasks where numeric fidelity matters (e.g., inventory counting, medical imaging reports).
- Design guidelines: Model architects might deliberately decouple prompt encoding from visual grounding, or add regularization that discourages direct prompt copying in early attention layers.
- Safety & compliance: Reducing hallucinations helps meet regulatory standards for AI systems that must provide fact‑based outputs (e.g., autonomous inspection, compliance reporting).
Limitations & Future Work
- Scope of tasks: The study focuses on a synthetic counting scenario; hallucination dynamics could differ for more complex, open‑ended descriptions.
- Model diversity: Only three VLM families were examined; newer multimodal transformers (e.g., GPT‑4‑V, LLaVA) may exhibit other hallucination pathways.
- Ablation side‑effects: While language fluency stayed stable in the tested benchmarks, subtle biases could emerge in downstream tasks not covered here.
- Future directions: Extending the analysis to real‑world datasets, exploring training‑time regularizers that suppress PIH‑heads, and investigating whether similar “copy‑shortcut” heads exist for other modalities (audio, video).
Authors
- William Rudman
- Michal Golovanevsky
- Dana Arad
- Yonatan Belinkov
- Ritambhara Singh
- Carsten Eickhoff
- Kyle Mahowald
Paper Information
- arXiv ID: 2601.05201v1
- Categories: cs.CV, cs.AI, cs.CL
- Published: January 8, 2026
- PDF: Download PDF