[Paper] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Published: (January 8, 2026 at 01:23 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.05201v1

Overview

Large vision‑language models (VLMs) can answer open‑ended questions about images, but they sometimes hallucinate—they repeat or “copy” the wording of a textual prompt even when the visual evidence contradicts it. This paper investigates why that happens, using a simple object‑counting task to expose the phenomenon and pinpoint the internal components responsible.

Key Contributions

  • Controlled experimental setup: Introduces a clean object‑counting benchmark where prompts deliberately overstate the number of objects, making hallucinations easy to detect.
  • Mechanistic discovery: Identifies a small set of attention heads (the “PIH‑heads”) whose ablation cuts prompt‑induced hallucinations (PIH) by ≥ 40 % across three state‑of‑the‑art VLMs, without any extra training.
  • Model‑specific analysis: Shows that the same heads behave differently in each architecture, revealing distinct ways that prompt copying is implemented.
  • Empirical validation: Demonstrates that removing PIH‑heads nudges the model toward relying on visual evidence, improving count accuracy especially for higher object numbers.
  • Open‑source tooling: Provides code for the counting benchmark and head‑ablation experiments, enabling reproducibility and further exploration.

Methodology

  1. Task design – Images contain a known number of identical objects (e.g., waterlilies). The prompt asks the model to “describe N objects,” where N is greater than the true count.
  2. Models evaluated – Three popular VLMs (a CLIP‑based encoder‑decoder, a BLIP‑style model, and a Flamingo‑inspired architecture).
  3. Prompt‑induced hallucination metric – The model’s output is parsed for the numeric count it mentions; a hallucination occurs when this count matches the inflated prompt rather than the visual ground truth.
  4. Attention‑head probing – Using gradient‑based attribution and causal mediation analysis, the authors locate heads whose activations correlate strongly with the hallucinated count.
  5. Ablation experiments – Those heads are zero‑ed out at inference time, and the impact on hallucination rate and overall answer quality is measured.

The approach is deliberately lightweight: no fine‑tuning, just a targeted “surgical” removal of a handful of attention heads.

Results & Findings

ModelBaseline PIH rate (high count)PIH rate after head ablationAccuracy gain
CLIP‑Encoder‑Decoder68 %38 %+12 % correct counts
BLIP‑style71 %34 %+15 % correct counts
Flamingo‑like65 %31 %+13 % correct counts
  • Head count: Only 3–5 heads per model needed to be removed to achieve the reported drop.
  • Prompt copying mechanism:
    • CLIP‑based models: heads act as a shortcut that directly injects the numeric token from the prompt into the decoder’s language stream.
    • BLIP: heads amplify the prompt embedding before cross‑attention.
    • Flamingo: heads bias the visual‑to‑text fusion layer.
  • No side‑effects: General language fluency and image‑captioning quality remain largely unchanged, confirming that the heads are specialized for the hallucination pathway.

Practical Implications

  • Debugging VLMs: Developers can instrument their models to monitor the activity of identified PIH‑heads, using it as an early warning signal for hallucination‑prone queries.
  • Lightweight mitigation: Instead of costly fine‑tuning or reinforcement learning from human feedback, a simple inference‑time head mask can be deployed in production pipelines to improve reliability on tasks where numeric fidelity matters (e.g., inventory counting, medical imaging reports).
  • Design guidelines: Model architects might deliberately decouple prompt encoding from visual grounding, or add regularization that discourages direct prompt copying in early attention layers.
  • Safety & compliance: Reducing hallucinations helps meet regulatory standards for AI systems that must provide fact‑based outputs (e.g., autonomous inspection, compliance reporting).

Limitations & Future Work

  • Scope of tasks: The study focuses on a synthetic counting scenario; hallucination dynamics could differ for more complex, open‑ended descriptions.
  • Model diversity: Only three VLM families were examined; newer multimodal transformers (e.g., GPT‑4‑V, LLaVA) may exhibit other hallucination pathways.
  • Ablation side‑effects: While language fluency stayed stable in the tested benchmarks, subtle biases could emerge in downstream tasks not covered here.
  • Future directions: Extending the analysis to real‑world datasets, exploring training‑time regularizers that suppress PIH‑heads, and investigating whether similar “copy‑shortcut” heads exist for other modalities (audio, video).

Authors

  • William Rudman
  • Michal Golovanevsky
  • Dana Arad
  • Yonatan Belinkov
  • Ritambhara Singh
  • Carsten Eickhoff
  • Kyle Mahowald

Paper Information

  • arXiv ID: 2601.05201v1
  • Categories: cs.CV, cs.AI, cs.CL
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »