[Paper] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Published: 1 month ago (January 8, 2026 at 01:23 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.05201v1

Overview

Large vision‑language models (VLMs) can answer open‑ended questions about images, but they sometimes hallucinate—they repeat or “copy” the wording of a textual prompt even when the visual evidence contradicts it. This paper investigates why that happens, using a simple object‑counting task to expose the phenomenon and pinpoint the internal components responsible.

Key Contributions

Controlled experimental setup: Introduces a clean object‑counting benchmark where prompts deliberately overstate the number of objects, making hallucinations easy to detect.
Mechanistic discovery: Identifies a small set of attention heads (the “PIH‑heads”) whose ablation cuts prompt‑induced hallucinations (PIH) by ≥ 40 % across three state‑of‑the‑art VLMs, without any extra training.
Model‑specific analysis: Shows that the same heads behave differently in each architecture, revealing distinct ways that prompt copying is implemented.
Empirical validation: Demonstrates that removing PIH‑heads nudges the model toward relying on visual evidence, improving count accuracy especially for higher object numbers.
Open‑source tooling: Provides code for the counting benchmark and head‑ablation experiments, enabling reproducibility and further exploration.

Methodology

Task design – Images contain a known number of identical objects (e.g., waterlilies). The prompt asks the model to “describe N objects,” where N is greater than the true count.
Models evaluated – Three popular VLMs (a CLIP‑based encoder‑decoder, a BLIP‑style model, and a Flamingo‑inspired architecture).
Prompt‑induced hallucination metric – The model’s output is parsed for the numeric count it mentions; a hallucination occurs when this count matches the inflated prompt rather than the visual ground truth.
Attention‑head probing – Using gradient‑based attribution and causal mediation analysis, the authors locate heads whose activations correlate strongly with the hallucinated count.
Ablation experiments – Those heads are zero‑ed out at inference time, and the impact on hallucination rate and overall answer quality is measured.

The approach is deliberately lightweight: no fine‑tuning, just a targeted “surgical” removal of a handful of attention heads.

Results & Findings

Model	Baseline PIH rate (high count)	PIH rate after head ablation	Accuracy gain
CLIP‑Encoder‑Decoder	68 %	38 %	+12 % correct counts
BLIP‑style	71 %	34 %	+15 % correct counts
Flamingo‑like	65 %	31 %	+13 % correct counts

Head count: Only 3–5 heads per model needed to be removed to achieve the reported drop.
Prompt copying mechanism:
- CLIP‑based models: heads act as a shortcut that directly injects the numeric token from the prompt into the decoder’s language stream.
- BLIP: heads amplify the prompt embedding before cross‑attention.
- Flamingo: heads bias the visual‑to‑text fusion layer.
No side‑effects: General language fluency and image‑captioning quality remain largely unchanged, confirming that the heads are specialized for the hallucination pathway.

Practical Implications

Debugging VLMs: Developers can instrument their models to monitor the activity of identified PIH‑heads, using it as an early warning signal for hallucination‑prone queries.
Lightweight mitigation: Instead of costly fine‑tuning or reinforcement learning from human feedback, a simple inference‑time head mask can be deployed in production pipelines to improve reliability on tasks where numeric fidelity matters (e.g., inventory counting, medical imaging reports).
Design guidelines: Model architects might deliberately decouple prompt encoding from visual grounding, or add regularization that discourages direct prompt copying in early attention layers.
Safety & compliance: Reducing hallucinations helps meet regulatory standards for AI systems that must provide fact‑based outputs (e.g., autonomous inspection, compliance reporting).

Limitations & Future Work

Scope of tasks: The study focuses on a synthetic counting scenario; hallucination dynamics could differ for more complex, open‑ended descriptions.
Model diversity: Only three VLM families were examined; newer multimodal transformers (e.g., GPT‑4‑V, LLaVA) may exhibit other hallucination pathways.
Ablation side‑effects: While language fluency stayed stable in the tested benchmarks, subtle biases could emerge in downstream tasks not covered here.
Future directions: Extending the analysis to real‑world datasets, exploring training‑time regularizers that suppress PIH‑heads, and investigating whether similar “copy‑shortcut” heads exist for other modalities (audio, video).

Authors

William Rudman
Michal Golovanevsky
Dana Arad
Yonatan Belinkov
Ritambhara Singh
Carsten Eickhoff
Kyle Mahowald

Paper Information

arXiv ID: 2601.05201v1
Categories: cs.CV, cs.AI, cs.CL
Published: January 8, 2026
PDF: Download PDF

[Paper] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

[Paper] Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts

[Paper] Multi-Modal Data-Enhanced Foundation Models for Prediction and Control in Wireless Networks: A Survey

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs