[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Published: 16 hours ago (April 23, 2026 at 01:54 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21911v1

Overview

Large Vision‑Language Models (LVLMs) have become impressively capable at answering visual questions, describing images, and even reasoning across modalities. Yet they still suffer from hallucinations—answers that sound plausible but aren’t actually grounded in the picture. This paper introduces Halluscope, a diagnostic benchmark that isolates why LVLMs hallucinate, and proposes HallU‑VL‑DPO, a fine‑tuning recipe that teaches models to trust the visual input over overly strong textual priors.

Key Contributions

Halluscope benchmark – a systematic suite of prompts that disentangles hallucinations caused by (a) visual backbone limits, (b) language dominance, and (c) textual instruction priors.
Empirical diagnosis – shows that the biggest culprit is the model’s reliance on textual priors injected via prompts and instructions, rather than deficiencies in the vision encoder.
HallU‑VL‑DPO framework – leverages Direct Preference Optimization (DPO) on a curated “grounded vs. hallucinated” dataset to re‑weight the model’s decision‑making toward visual fidelity.
Comprehensive evaluation – demonstrates that the DPO‑fine‑tuned LVLM reduces prompt‑induced hallucinations while maintaining or improving scores on existing hallucination and visual‑reasoning benchmarks.
Open resources – releases the Halluscope benchmark, the preference training set, and code to enable reproducibility and community extensions.

Methodology

1. Benchmark Design (Halluscope)

Constructed three families of test cases:
1. Vision‑only questions (minimal textual bias).
2. Language‑heavy prompts that embed strong world knowledge (e.g., “Describe the Eiffel Tower in the picture”).
3. Instruction‑driven prompts where the system is told to “explain the scene as if you were a historian”.
Each case is paired with a ground‑truth visual answer and a plausible hallucinated distractor.

2. Diagnosing the Failure Mode

Ran several off‑the‑shelf LVLMs (e.g., LLaVA, MiniGPT‑4) on Halluscope.
Measured hallucination rates per prompt family and performed ablation studies (e.g., removing the instruction, swapping vision backbones).

3. Preference‑Based Fine‑Tuning (HallU‑VL‑DPO)

Collected a preference dataset: for each image‑prompt pair, annotators ranked a grounded response higher than a hallucinated one.
Applied Direct Preference Optimization, a reinforcement‑learning‑free method that directly updates the model’s logits to increase the probability of the preferred answer.
Fine‑tuned only the language head, keeping the vision encoder frozen, which makes the approach lightweight and compatible with existing LVLM checkpoints.

4. Evaluation

Tested the fine‑tuned model on Halluscope and on three public hallucination benchmarks (e.g., VQA‑Hallucination, MME‑Hallucination).
Also measured standard visual‑language metrics (VQA accuracy, image captioning BLEU/ROUGE) to ensure no regression in overall capability.

Results & Findings

Metric	Off‑the‑shelf LVLM	HallU‑VL‑DPO (fine‑tuned)
Halluscope hallucination rate (overall)	38%	12%
Instruction‑driven hallucinations	52%	14%
Vision‑only hallucinations	22%	10%
VQA accuracy (standard test set)	78.3%	79.1%
Image captioning CIDEr	112.5	113.8

Primary insight: textual instructions dramatically amplify hallucination; when those cues are removed, the same LVLM hallucinates far less.
HallU‑VL‑DPO cuts the targeted failure mode by ~75 % while slightly improving core visual‑language performance, indicating that the model learns to prioritize visual evidence without sacrificing language fluency.
Ablations confirm that the vision backbone is not the bottleneck—freezing it during DPO still yields large gains, reinforcing the “language dominance” hypothesis.

Practical Implications

Safer AI assistants: Developers building multimodal chatbots (e.g., for e‑commerce, medical imaging) can integrate HallU‑VL‑DPO to reduce the risk of confidently wrong visual statements.
Prompt engineering guidelines: The study suggests avoiding overly prescriptive instructions; instead, steer models with neutral queries (“What do you see?”) to keep hallucinations low.
Plug‑and‑play fine‑tuning: Because only the language head is updated, existing LVLM deployments can be upgraded with a few hours of DPO training on modest GPU resources.
Benchmark‑driven QA pipelines: Halluscope can serve as a regression test for any new LVLM release, ensuring that improvements in raw accuracy don’t come at the cost of visual grounding.
Regulatory compliance: For domains where factual correctness is mandated (e.g., autonomous inspection, legal document analysis), the approach offers a concrete mitigation strategy that can be audited.

Limitations & Future Work

Scope of hallucinations: Halluscope focuses on prompt‑induced hallucinations; other failure modes (e.g., occlusion, low‑resolution inputs) remain less explored.
Dataset bias: The preference set is curated from a limited set of image domains (mostly everyday scenes); performance on specialized domains (medical, satellite) may differ.
Model size dependence: Experiments were run on 7B‑13B LVLMs; it is unclear how the method scales to larger (30B+) models where language priors might be even stronger.
User‑controlled trade‑offs: The current DPO loss treats grounding as always preferable; future work could let developers balance creativity vs. fidelity per application.

The authors plan to expand Halluscope with more diverse visual domains, explore multi‑modal DPO (including audio), and investigate adaptive prompting techniques that automatically detect and suppress high‑risk instruction patterns.

Authors

Pegah Khayatan
Jayneel Parekh
Arnaud Dapogny
Mustafa Shukor
Alasdair Newson
Matthieu Cord

Paper Information

arXiv ID: 2604.21911v1
Categories: cs.CV, cs.AI, cs.CL, cs.LG
Published: April 23, 2026
PDF: Download PDF