[Paper] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Source: arXiv - 2602.22144v1
Overview
Large Vision‑Language Models (LVLMs) have become the go‑to backbone for multimodal assistants, but they often “hallucinate” objects that aren’t actually in the picture. This paper digs into why that happens and proposes a lightweight, training‑free decoding tweak—NoLan—that dramatically cuts hallucinations without sacrificing performance.
Key Contributions
- Root‑cause analysis: Systematic experiments show that the language decoder’s strong priors, not the vision encoder, are the primary driver of object hallucinations.
- NoLan framework: Introduces a dynamic, inference‑time suppression of language priors based on the discrepancy between multimodal and text‑only output distributions.
- Training‑free solution: No additional model parameters or fine‑tuning are required; the method works as a plug‑in to any existing LVLM.
- Broad validation: Demonstrates consistent hallucination reduction across multiple LVLMs (e.g., LLaVA‑1.5 7B, Qwen‑VL 7B) and tasks (POPE, VQA, captioning).
- Open‑source release: Code and integration scripts are publicly available, encouraging rapid adoption.
Methodology
- Decomposing the pipeline – The authors isolate the vision encoder and language decoder by feeding the same visual features to a text‑only language model and comparing its output distribution to that of the full LVLM.
- Measuring prior influence – They compute the KL‑divergence between the multimodal output distribution and the text‑only baseline. A large divergence signals that the language decoder is injecting strong priors.
- Dynamic suppression – During decoding, NoLan scales down the logits (raw token scores) that are overly boosted by language priors. The scaling factor is a function of the observed divergence: the bigger the gap, the stronger the suppression.
- Implementation – The technique is a thin wrapper around the standard beam‑search or sampling decoder; it requires no extra training data, gradients, or architectural changes.
Results & Findings
| Model | Task | Baseline Accuracy | NoLan Accuracy | Δ Improvement |
|---|---|---|---|---|
| LLaVA‑1.5 7B | POPE (hallucination benchmark) | 71.3 % | 77.8 % | +6.5 % |
| Qwen‑VL 7B | POPE | 68.9 % | 76.1 % | +7.2 % |
| Various LVLMs | VQA & Image Captioning | Comparable or slightly lower | Same or higher | ≤ 0 % loss, often +1‑2 % |
Key takeaways
- NoLan consistently lowers the rate of fabricated objects across models and tasks.
- Because the method only modifies the decoding logits, there is virtually no overhead (≈ 1 ms per inference).
- The approach does not degrade the model’s ability to generate fluent, context‑aware language.
Practical Implications
- Deploy‑ready safety layer: Teams can integrate NoLan into existing LVLM services (e.g., chat‑bots, visual assistants) to make outputs more trustworthy without retraining.
- Regulatory compliance: Reducing hallucinations helps meet emerging AI transparency standards that require verifiable outputs.
- Cost‑effective improvement: Since NoLan is inference‑only, it avoids the compute expense of fine‑tuning large multimodal models.
- Better user experience: Fewer false object mentions mean clearer instructions for downstream pipelines (e.g., robotics, AR overlays) that rely on accurate visual grounding.
Limitations & Future Work
- Scope of hallucinations: The study focuses on object hallucinations; other types (e.g., attribute or relational hallucinations) remain unaddressed.
- Dependency on baseline text‑only model: The effectiveness of the suppression factor hinges on the quality of the text‑only decoder used for comparison.
- Potential over‑suppression: In edge cases where the language prior is actually correct (e.g., commonsense inference), NoLan might dampen useful information.
- Future directions: Extending the dynamic suppression concept to handle attribute hallucinations, exploring adaptive thresholds per token type, and integrating visual grounding checks for a tighter vision‑language feedback loop.
Authors
- Lingfeng Ren
- Weihao Yu
- Runpeng Yu
- Xinchao Wang
Paper Information
- arXiv ID: 2602.22144v1
- Categories: cs.CV, cs.AI, cs.CL
- Published: February 25, 2026
- PDF: Download PDF