[Paper] MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Published: 2 days ago (March 3, 2026 at 12:50 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.03192v1

Overview

Omni‑modal large language models (LLMs) can reason over text, images, and audio in a single system, but they often “hallucinate” – producing answers that are inconsistent with the visual or auditory input. The paper MoD‑DPO introduces a lightweight training recipe that explicitly teaches these models to stay grounded in the right modality while ignoring irrelevant signals, dramatically cutting down cross‑modal hallucinations.

Key Contributions

Modality‑Decoupled Direct Preference Optimization (MoD‑DPO): a new fine‑tuning framework that adds modality‑aware regularizers to the standard DPO loss.
Invariance & Sensitivity Regularization: forces the model to be invariant to corruptions in non‑relevant modalities (e.g., noisy audio when answering a visual question) and sensitive to perturbations in the relevant modality.
Language‑Prior Debiasing Penalty: a term that penalizes text‑only responses that are likely driven by language priors rather than multimodal evidence.
Empirical Validation: state‑of‑the‑art reductions in hallucination rates on several audiovisual benchmarks, achieved with the same compute budget as prior preference‑optimization methods.
Scalable Design: the approach works as a drop‑in add‑on to existing omni‑LLMs without architectural changes.

Methodology

Base Model – Start from a pretrained omni‑modal LLM (e.g., Flamingo‑2, LLaVA‑Video) that already supports text‑image‑audio inputs.
Preference Data – Collect pairs of model outputs: a “good” response that correctly references the relevant modality, and a “bad” response that either ignores the modality or leans on language priors.
Direct Preference Optimization (DPO) – Optimize the model to assign higher likelihood to the good response using a binary cross‑entropy loss over the preference pair.
Modality‑Decoupled Regularizers
- Irrelevant‑Modality Invariance: Randomly corrupt the non‑relevant modality (e.g., blur the image when the task is audio‑question answering) and enforce that the model’s logits for the good response stay unchanged.
- Relevant‑Modality Sensitivity: Apply a mild perturbation to the relevant modality (e.g., add background noise to the audio) and require the model’s logits to shift proportionally, encouraging true grounding.
Language‑Prior Debiasing – Add a penalty proportional to the probability that the model would produce the same answer when fed only the textual prompt, discouraging “text‑only shortcuts.”
Training Loop – The final loss is a weighted sum of the DPO term, the two regularizers, and the debiasing penalty. Training proceeds for a few epochs on the preference dataset, which is far cheaper than full‑scale multimodal pre‑training.

Results & Findings

Benchmark	Baseline DPO Hallucination Rate	MoD‑DPO Hallucination Rate	Perception Accuracy (↑)
AVQA‑Hallucination (audio‑visual QA)	23%	12%	+5.4 pts
Video‑Storytelling (visual‑only)	18%	9%	+4.1 pts
Multimodal NLI (text‑+‑image)	21%	11%	+6.2 pts

Consistent Gains: Across all three datasets, MoD‑DPO cuts hallucinations by roughly 40‑50% while improving answer correctness.
Compute‑Efficient: The method matches or exceeds prior DPO baselines using the same number of GPU‑hours (≈ 2‑3 k GPU‑h).
Ablation Insights: Removing the invariance term leads to a 15% jump in hallucinations; dropping the language‑prior penalty inflates text‑only shortcuts by 8%.
Robustness: The model remains stable when faced with out‑of‑distribution modality corruptions, indicating better generalization.

Practical Implications

More Reliable Assistants: Developers building multimodal chatbots (e.g., video‑support agents, audio‑guided editors) can integrate MoD‑DPO to ensure the assistant’s replies truly reflect the supplied media, reducing user frustration.
Safety & Compliance: In regulated domains (medical imaging, autonomous driving), grounding guarantees are essential; MoD‑DPO offers a tractable way to certify that model outputs are not hallucinated.
Cost‑Effective Fine‑Tuning: Since the method works on top of existing foundation models and needs only preference data (which can be generated via human‑in‑the‑loop or LLM‑self‑ranking), teams can improve multimodal fidelity without massive pre‑training budgets.
Tooling Integration: The regularizers are simple PyTorch modules; they can be wrapped into popular libraries (e.g., 🤗 Transformers) as a “modality‑aware DPO” trainer, lowering the barrier for adoption.
Better User Experience: Applications like video summarization, captioning, or multimodal search will produce results that align with the visual/audio cues, leading to higher engagement and trust.

Limitations & Future Work

Preference Data Dependency – MoD‑DPO still requires high‑quality preference pairs; generating these at scale for niche domains may be labor‑intensive.
Modality Scope – The paper focuses on audio‑visual tasks; extending the regularizers to other modalities (e.g., depth, 3‑D point clouds, sensor data) remains an open question.
Perturbation Design – The effectiveness of invariance/sensitivity regularization hinges on the choice of corruptions; suboptimal perturbations could either over‑constrain the model or fail to capture subtle cross‑modal cues.
Long‑Form Consistency – While short QA and captioning improve, maintaining modality fidelity over long narratives or multi‑turn dialogues needs further investigation.
Future Directions – The authors suggest exploring automated curriculum learning for perturbation strength, integrating contrastive multimodal objectives, and evaluating MoD‑DPO on emerging foundation models that fuse language with video, 3‑D, and sensor streams.

Authors

Ashutosh Chaubey
Jiacheng Pang
Mohammad Soleymani

Paper Information

arXiv ID: 2603.03192v1
Categories: cs.CV, cs.CL, cs.LG
Published: March 3, 2026
PDF: Download PDF

[Paper] MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought