[Paper] Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Source: arXiv - 2512.08923v1
Overview
The paper “Same Content, Different Answers: Cross‑Modal Inconsistency in MLLMs” uncovers a surprising blind spot in today’s multimodal large language models (MLLMs): even when presented with identical semantic information in text, image, or a mix of both, the models often give different answers. To diagnose and quantify this problem, the authors introduce two new benchmarks—REST and REST+ (Render‑Equivalence Stress Tests)—that systematically probe how consistently MLLMs reason across modalities.
Key Contributions
- Two novel benchmarks (REST & REST+): Curated sets of triplets (text, image, mixed) that convey the same factual content, enabling direct measurement of cross‑modal consistency.
- Comprehensive evaluation of 15 state‑of‑the‑art MLLMs: Includes popular open‑source and commercial models, revealing wide variability in consistency scores.
- In‑depth analysis of visual factors: Demonstrates that text colour, resolution, and the number of vision tokens affect performance, while font style does not.
- Mechanistic link to modality gap: Shows that a model’s consistency score correlates with the embedding‑space distance between its text and image representations, offering a quantitative diagnostic.
- Open‑source release: Benchmark data, evaluation scripts, and consistency metrics are publicly available for the community.
Methodology
-
Benchmark Construction
- REST: 1,200 semantic facts (e.g., “The Eiffel Tower is in Paris”) rendered as plain text, as a rendered image of the same sentence, and as a mixed prompt (image + text).
- REST+: Extends REST with stress‑test variations—different text colours, resolutions, and token counts—to probe visual robustness.
-
Model Selection & Prompting
- 15 MLLMs spanning vision‑language transformers (e.g., BLIP‑2, LLaVA), instruction‑tuned models (e.g., GPT‑4V), and open‑source alternatives (e.g., MiniGPT‑4).
- Uniform prompting: “Answer the question based on the content provided.” The same question is asked for each modality of a given fact.
-
Consistency Scoring
- Answers are normalized (case‑folding, synonym mapping) and compared pairwise across modalities.
- Consistency Score = 1 – average pairwise disagreement (0 = completely inconsistent, 1 = perfectly consistent).
-
Controlled Analyses
- OCR accuracy is measured separately to isolate pure visual‑embedding effects.
- Ablation studies vary colour, resolution, and token count while keeping the underlying text constant.
Results & Findings
| Model | Avg. Consistency (REST) | Avg. Consistency (REST+) |
|---|---|---|
| GPT‑4V (proprietary) | 0.78 | 0.71 |
| LLaVA‑1.5‑13B | 0.55 | 0.48 |
| MiniGPT‑4‑7B | 0.42 | 0.35 |
| BLIP‑2‑FlanT5‑XXL | 0.61 | 0.54 |
- Large variance: Even top‑tier models differ by >30 % in consistency.
- OCR isn’t the whole story: After correcting OCR errors, inconsistency persists, indicating deeper representation gaps.
- Visual attributes matter: Low‑contrast text (e.g., light grey on white) and low‑resolution renders cause up to a 15 % drop in consistency; font style has negligible impact.
- Token count effect: Images that require more vision tokens (larger or more complex scenes) lead to higher inconsistency, suggesting capacity limits in the visual encoder.
- Modality gap correlation: Pearson r = 0.68 between consistency score and the Euclidean distance of a model’s text vs. image embeddings, supporting the hypothesis that a larger embedding gap drives inconsistency.
Practical Implications
- Reliability in mixed‑modal pipelines: Developers building applications that switch between OCR‑based text extraction and direct image understanding (e.g., document AI, visual assistants) should not assume interchangeable performance.
- Benchmark‑driven model selection: REST/REST+ can be incorporated into CI/CD testing to pick models that meet a required consistency threshold for critical use‑cases.
- Prompt engineering: Adding explicit modality‑agnostic prompts (e.g., “Treat the following content as factual regardless of format”) can modestly improve consistency, but cannot replace architectural fixes.
- Model design guidance: The correlation with modality gap suggests that future MLLM architectures should enforce tighter alignment between visual and textual encoders—e.g., joint contrastive training with cross‑modal consistency losses.
- User experience: For end‑users, inconsistent answers can erode trust. UI designers might display a “confidence” indicator that reflects cross‑modal agreement, warning users when the model’s answer varies across formats.
Limitations & Future Work
- Scope of content: Benchmarks focus on factual statements; reasoning‑heavy or narrative content may exhibit different inconsistency patterns.
- Language coverage: All prompts are in English; multilingual consistency remains unexplored.
- Static evaluation: The study does not assess how fine‑tuning on consistency‑oriented data would shift the modality gap.
- Hardware constraints: Some large models could not be evaluated on the full benchmark due to GPU memory limits, potentially biasing the sample toward smaller models.
Future research directions include extending REST+ to multilingual and multimodal‑reasoning tasks, developing training objectives that directly minimize the modality gap, and exploring dynamic token‑allocation strategies to mitigate vision‑token bottlenecks.
Authors
- Angela van Sprang
- Laurens Samson
- Ana Lucic
- Erman Acar
- Sennay Ghebreab
- Yuki M. Asano
Paper Information
- arXiv ID: 2512.08923v1
- Categories: cs.AI
- Published: December 9, 2025
- PDF: Download PDF