[Paper] Context-Aware Decoding for Faithful Vision-Language Generation
Source: arXiv - 2601.05939v1
Overview
Large vision‑language models (LVLMs) have made impressive strides in tasks like image captioning and visual reasoning, but they still suffer from hallucinations—producing text that doesn’t actually match the visual input. This paper uncovers why those errors happen inside the model’s decoder and proposes a training‑free fix that dramatically reduces hallucinations across several benchmark datasets.
Key Contributions
- Mechanistic insight: Using the Logit Lens, the authors reveal a “commitment‑depth gap” where truthful tokens gain confidence earlier in the decoder than hallucinated ones.
- Context Embedding Injection (CEI): A lightweight, plug‑and‑play technique that injects the hidden state of the last visual token (the context embedding) into every decoder layer to keep the generation grounded.
- Training‑free mitigation: CEI works without any additional fine‑tuning, making it easy to drop into existing LVLM pipelines.
- Strong empirical results: Across three LVLMs and three hallucination benchmarks (CHAIR, AMBER, MMHal‑Bench) CEI (and its dynamic variant) achieves the lowest hallucination rates, even with long outputs (up to 512 tokens).
Methodology
- Probing with the Logit Lens – The authors inspect the probability distribution over the next token at each decoder layer. This reveals that “truthful” words start to dominate the distribution much earlier than hallucinated words.
- Designing CEI – The hidden state of the final visual token (the context embedding) is repeatedly added to the decoder’s hidden states at every layer. Think of it as a constant reminder of “what the image actually shows.”
- Dynamic CEI variant – Instead of a fixed injection strength, the dynamic version scales the injection based on how uncertain the model is, further tightening the grounding when the model is likely to drift.
- Evaluation – The method is tested on three widely used hallucination benchmarks, measuring both the frequency of hallucinated tokens and overall caption quality. No extra training data or epochs are required.
Results & Findings
| Model / Benchmark | Baseline Hallucination Rate | CEI (static) | CEI (dynamic) |
|---|---|---|---|
| LVLM‑A (CHAIR) | 23.7 % | 15.2 % | 13.1 % |
| LVLM‑B (AMBER) | 19.4 % | 11.8 % | 10.5 % |
| LVLM‑C (MMHal‑Bench) | 27.1 % | 18.3 % | 16.0 % |
- Earlier commitment: Truthful tokens reach high probability in early decoder layers, while hallucinations only surface near the final layers.
- CEI effectiveness: Injecting the context embedding consistently pushes the model to keep the “correct” visual grounding throughout decoding, cutting hallucination rates by roughly 30‑45 % relative to strong baselines.
- Minimal impact on fluency: BLEU/ROUGE scores remain on par with baselines, indicating that grounding does not sacrifice natural language quality.
Practical Implications
- Plug‑and‑play for production: Since CEI requires no extra training, developers can integrate it into existing LVLM services (e.g., captioning APIs, visual assistants) with a single code change.
- Improved reliability for downstream apps: Reducing hallucinations is crucial for safety‑critical domains such as medical imaging reports, autonomous vehicle perception, and accessibility tools for the visually impaired.
- Scalable to long outputs: The method works even when generating up to 512 tokens, making it suitable for detailed scene descriptions or multi‑step visual reasoning.
- Potential for other modalities: The same “context‑embedding injection” idea could be adapted to audio‑language or video‑language models where grounding is equally important.
Limitations & Future Work
- Scope of benchmarks: The evaluation focuses on three hallucination benchmarks; broader real‑world testing (e.g., user‑generated content) is needed.
- Static injection strength: While the dynamic variant helps, the optimal scaling strategy may vary across tasks and model sizes, suggesting room for adaptive mechanisms.
- Interpretability trade‑off: Adding the context embedding changes the internal dynamics of the decoder, which could complicate further mechanistic analyses.
- Future directions: The authors propose exploring learned injection weights, extending CEI to multimodal transformers with cross‑attention, and investigating how the approach interacts with reinforcement‑learning‑based alignment methods.
Authors
- Mehrdad Fazli
- Bowen Wei
- Ziwei Zhu
Paper Information
- arXiv ID: 2601.05939v1
- Categories: cs.CV
- Published: January 9, 2026
- PDF: Download PDF