[Paper] MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Source: arXiv - 2603.03192v1
Overview
Omni‑modal large language models (LLMs) can reason over text, images, and audio in a single system, but they often “hallucinate” – producing answers that are inconsistent with the visual or auditory input. The paper MoD‑DPO introduces a lightweight training recipe that explicitly teaches these models to stay grounded in the right modality while ignoring irrelevant signals, dramatically cutting down cross‑modal hallucinations.
Key Contributions
- Modality‑Decoupled Direct Preference Optimization (MoD‑DPO): a new fine‑tuning framework that adds modality‑aware regularizers to the standard DPO loss.
- Invariance & Sensitivity Regularization: forces the model to be invariant to corruptions in non‑relevant modalities (e.g., noisy audio when answering a visual question) and sensitive to perturbations in the relevant modality.
- Language‑Prior Debiasing Penalty: a term that penalizes text‑only responses that are likely driven by language priors rather than multimodal evidence.
- Empirical Validation: state‑of‑the‑art reductions in hallucination rates on several audiovisual benchmarks, achieved with the same compute budget as prior preference‑optimization methods.
- Scalable Design: the approach works as a drop‑in add‑on to existing omni‑LLMs without architectural changes.
Methodology
- Base Model – Start from a pretrained omni‑modal LLM (e.g., Flamingo‑2, LLaVA‑Video) that already supports text‑image‑audio inputs.
- Preference Data – Collect pairs of model outputs: a “good” response that correctly references the relevant modality, and a “bad” response that either ignores the modality or leans on language priors.
- Direct Preference Optimization (DPO) – Optimize the model to assign higher likelihood to the good response using a binary cross‑entropy loss over the preference pair.
- Modality‑Decoupled Regularizers
- Irrelevant‑Modality Invariance: Randomly corrupt the non‑relevant modality (e.g., blur the image when the task is audio‑question answering) and enforce that the model’s logits for the good response stay unchanged.
- Relevant‑Modality Sensitivity: Apply a mild perturbation to the relevant modality (e.g., add background noise to the audio) and require the model’s logits to shift proportionally, encouraging true grounding.
- Language‑Prior Debiasing – Add a penalty proportional to the probability that the model would produce the same answer when fed only the textual prompt, discouraging “text‑only shortcuts.”
- Training Loop – The final loss is a weighted sum of the DPO term, the two regularizers, and the debiasing penalty. Training proceeds for a few epochs on the preference dataset, which is far cheaper than full‑scale multimodal pre‑training.
Results & Findings
| Benchmark | Baseline DPO Hallucination Rate | MoD‑DPO Hallucination Rate | Perception Accuracy (↑) |
|---|---|---|---|
| AVQA‑Hallucination (audio‑visual QA) | 23% | 12% | +5.4 pts |
| Video‑Storytelling (visual‑only) | 18% | 9% | +4.1 pts |
| Multimodal NLI (text‑+‑image) | 21% | 11% | +6.2 pts |
- Consistent Gains: Across all three datasets, MoD‑DPO cuts hallucinations by roughly 40‑50% while improving answer correctness.
- Compute‑Efficient: The method matches or exceeds prior DPO baselines using the same number of GPU‑hours (≈ 2‑3 k GPU‑h).
- Ablation Insights: Removing the invariance term leads to a 15% jump in hallucinations; dropping the language‑prior penalty inflates text‑only shortcuts by 8%.
- Robustness: The model remains stable when faced with out‑of‑distribution modality corruptions, indicating better generalization.
Practical Implications
- More Reliable Assistants: Developers building multimodal chatbots (e.g., video‑support agents, audio‑guided editors) can integrate MoD‑DPO to ensure the assistant’s replies truly reflect the supplied media, reducing user frustration.
- Safety & Compliance: In regulated domains (medical imaging, autonomous driving), grounding guarantees are essential; MoD‑DPO offers a tractable way to certify that model outputs are not hallucinated.
- Cost‑Effective Fine‑Tuning: Since the method works on top of existing foundation models and needs only preference data (which can be generated via human‑in‑the‑loop or LLM‑self‑ranking), teams can improve multimodal fidelity without massive pre‑training budgets.
- Tooling Integration: The regularizers are simple PyTorch modules; they can be wrapped into popular libraries (e.g., 🤗 Transformers) as a “modality‑aware DPO” trainer, lowering the barrier for adoption.
- Better User Experience: Applications like video summarization, captioning, or multimodal search will produce results that align with the visual/audio cues, leading to higher engagement and trust.
Limitations & Future Work
- Preference Data Dependency – MoD‑DPO still requires high‑quality preference pairs; generating these at scale for niche domains may be labor‑intensive.
- Modality Scope – The paper focuses on audio‑visual tasks; extending the regularizers to other modalities (e.g., depth, 3‑D point clouds, sensor data) remains an open question.
- Perturbation Design – The effectiveness of invariance/sensitivity regularization hinges on the choice of corruptions; suboptimal perturbations could either over‑constrain the model or fail to capture subtle cross‑modal cues.
- Long‑Form Consistency – While short QA and captioning improve, maintaining modality fidelity over long narratives or multi‑turn dialogues needs further investigation.
- Future Directions – The authors suggest exploring automated curriculum learning for perturbation strength, integrating contrastive multimodal objectives, and evaluating MoD‑DPO on emerging foundation models that fuse language with video, 3‑D, and sensor streams.
Authors
- Ashutosh Chaubey
- Jiacheng Pang
- Mohammad Soleymani
Paper Information
- arXiv ID: 2603.03192v1
- Categories: cs.CV, cs.CL, cs.LG
- Published: March 3, 2026
- PDF: Download PDF