[Paper] Information Router for Mitigating Modality Dominance in Vision-Language Models

Published: 2 days ago (April 17, 2026 at 01:20 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16264v1

Overview

Vision‑Language Models (VLMs) have become the workhorse for tasks that blend images and text, from captioning to visual question answering. However, they often fall into modality dominance – the model leans heavily on either the visual or the textual stream, ignoring useful cues from the other side. The paper “Information Router for Mitigating Modality Dominance in Vision‑Language Models” introduces MoIR (Multi‑modal Information Router), a lightweight plug‑in that equalizes the information content of each modality before the two streams are fused, leading to more balanced reasoning and stronger robustness when one modality is noisy or missing.

Key Contributions

Information‑level fusion: MoIR detects “weak” tokens (low‑information visual or textual patches) and routes complementary information from the stronger modality to enrich them.
Modality‑agnostic design: Works with any backbone that produces token‑wise embeddings (e.g., CLIP, ViLT, BLIP) and any downstream language model.
Empirical validation: Improves performance on three standard multimodal benchmarks (VQA, NLVR2, and COCO‑Caption) across multiple model sizes.
Robustness gains: Demonstrates consistent accuracy lifts when one modality is deliberately degraded (blurred images, corrupted captions).
Interpretability: Provides token‑level visualizations showing how information is re‑distributed, making modality dominance observable and controllable.

Methodology

Token‑wise information scoring – For each modality, MoIR computes a simple “information density” score per token (e.g., entropy of the token embedding or a learned confidence head). Low‑scoring tokens are flagged as under‑informative.
Cross‑modal routing – When a visual token is weak, MoIR extracts the most relevant textual token(s) (and vice‑versa) using a lightweight similarity lookup. The selected complementary token is then added (or concatenated) to the weak token, producing an enriched representation.
Router module placement – The enriched token set replaces the original token stream before it is fed to the large language model (LLM) that performs multimodal reasoning. The router itself contains only a few linear layers, so it adds negligible latency.
Training – MoIR is trained end‑to‑end with the downstream task loss; the router learns to identify which tokens need help and which source tokens are most useful. No extra supervision about modality dominance is required.

The overall pipeline can be visualized as:

Image → Vision Encoder → Token Embeddings → MoIR ← Text Encoder ← Caption → Token Embeddings → LLM → Output

Results & Findings

Benchmark	Baseline (no MoIR)	+ MoIR	Δ (↑)
VQA‑2.0 (accuracy)	71.3%	73.9%	+2.6 pts
NLVR2 (accuracy)	78.1%	80.5%	+2.4 pts
COCO‑Caption (CIDEr)	124.6	129.8	+5.2 pts

Balanced modality contribution: Attribution analysis shows a ~30% reduction in the dominance ratio (visual vs. textual attention) across tasks.
Degradation robustness: With 50% image blur, MoIR recovers ~1.8 % absolute accuracy compared to a 4 % drop in the baseline; with 30% caption word dropout, MoIR gains ~2.1 % over baseline.
Efficiency: The router adds < 5 ms inference overhead on a single V100 GPU, making it suitable for real‑time services.

These numbers confirm that explicitly enriching weak tokens is more effective than merely re‑weighting attention heads.

Practical Implications

More reliable multimodal assistants: Voice‑enabled image search or chat‑bots that must handle low‑quality photos (e.g., from mobile cameras) can maintain answer quality.
Edge deployment: Since MoIR is a small plug‑in, developers can retrofit existing VLMs on edge devices without retraining the entire backbone.
Data‑efficiency: In scenarios where one modality is cheap to collect (text) but the other is noisy (sensor images), MoIR can automatically compensate, reducing the need for costly data cleaning.
Safety & bias mitigation: By preventing a model from over‑relying on a single modality, MoIR can lower the risk of hallucinations caused by missing visual cues or misinterpretations of ambiguous text.
Cross‑modal debugging: The token‑level routing map offers a new diagnostic tool for engineers to spot where the model is starving for information and to guide data collection efforts.

Limitations & Future Work

Scoring simplicity: The current information density metric is heuristic; more sophisticated uncertainty estimators could improve routing decisions.
Limited to token‑wise backbones: Models that fuse modalities at the feature‑map level (e.g., early‑fusion CNN‑RNN hybrids) may need architectural adjustments to benefit from MoIR.
Potential over‑reliance on the stronger modality: In extreme cases where one modality is completely corrupted, the router may overly copy from the other side, which could mask underlying data quality issues.
Future directions: The authors suggest exploring adaptive routing policies that consider task‑specific relevance, extending MoIR to video‑text settings, and integrating it with self‑supervised pre‑training to learn modality‑agnostic information scores.

Authors

Seulgi Kim
Mohit Prabhushankar
Ghassan AlRegib