[Paper] Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Published: 2 months ago (December 9, 2025 at 01:57 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08923v1

Overview

The paper “Same Content, Different Answers: Cross‑Modal Inconsistency in MLLMs” uncovers a surprising blind spot in today’s multimodal large language models (MLLMs): even when presented with identical semantic information in text, image, or a mix of both, the models often give different answers. To diagnose and quantify this problem, the authors introduce two new benchmarks—REST and REST+ (Render‑Equivalence Stress Tests)—that systematically probe how consistently MLLMs reason across modalities.

Key Contributions

Two novel benchmarks (REST & REST+): Curated sets of triplets (text, image, mixed) that convey the same factual content, enabling direct measurement of cross‑modal consistency.
Comprehensive evaluation of 15 state‑of‑the‑art MLLMs: Includes popular open‑source and commercial models, revealing wide variability in consistency scores.
In‑depth analysis of visual factors: Demonstrates that text colour, resolution, and the number of vision tokens affect performance, while font style does not.
Mechanistic link to modality gap: Shows that a model’s consistency score correlates with the embedding‑space distance between its text and image representations, offering a quantitative diagnostic.
Open‑source release: Benchmark data, evaluation scripts, and consistency metrics are publicly available for the community.

Methodology

Benchmark Construction
- REST: 1,200 semantic facts (e.g., “The Eiffel Tower is in Paris”) rendered as plain text, as a rendered image of the same sentence, and as a mixed prompt (image + text).
- REST+: Extends REST with stress‑test variations—different text colours, resolutions, and token counts—to probe visual robustness.
Model Selection & Prompting
- 15 MLLMs spanning vision‑language transformers (e.g., BLIP‑2, LLaVA), instruction‑tuned models (e.g., GPT‑4V), and open‑source alternatives (e.g., MiniGPT‑4).
- Uniform prompting: “Answer the question based on the content provided.” The same question is asked for each modality of a given fact.
Consistency Scoring
- Answers are normalized (case‑folding, synonym mapping) and compared pairwise across modalities.
- Consistency Score = 1 – average pairwise disagreement (0 = completely inconsistent, 1 = perfectly consistent).
Controlled Analyses
- OCR accuracy is measured separately to isolate pure visual‑embedding effects.
- Ablation studies vary colour, resolution, and token count while keeping the underlying text constant.

Results & Findings

Model	Avg. Consistency (REST)	Avg. Consistency (REST+)
GPT‑4V (proprietary)	0.78	0.71
LLaVA‑1.5‑13B	0.55	0.48
MiniGPT‑4‑7B	0.42	0.35
BLIP‑2‑FlanT5‑XXL	0.61	0.54

Large variance: Even top‑tier models differ by >30 % in consistency.
OCR isn’t the whole story: After correcting OCR errors, inconsistency persists, indicating deeper representation gaps.
Visual attributes matter: Low‑contrast text (e.g., light grey on white) and low‑resolution renders cause up to a 15 % drop in consistency; font style has negligible impact.
Token count effect: Images that require more vision tokens (larger or more complex scenes) lead to higher inconsistency, suggesting capacity limits in the visual encoder.
Modality gap correlation: Pearson r = 0.68 between consistency score and the Euclidean distance of a model’s text vs. image embeddings, supporting the hypothesis that a larger embedding gap drives inconsistency.

Practical Implications

Reliability in mixed‑modal pipelines: Developers building applications that switch between OCR‑based text extraction and direct image understanding (e.g., document AI, visual assistants) should not assume interchangeable performance.
Benchmark‑driven model selection: REST/REST+ can be incorporated into CI/CD testing to pick models that meet a required consistency threshold for critical use‑cases.
Prompt engineering: Adding explicit modality‑agnostic prompts (e.g., “Treat the following content as factual regardless of format”) can modestly improve consistency, but cannot replace architectural fixes.
Model design guidance: The correlation with modality gap suggests that future MLLM architectures should enforce tighter alignment between visual and textual encoders—e.g., joint contrastive training with cross‑modal consistency losses.
User experience: For end‑users, inconsistent answers can erode trust. UI designers might display a “confidence” indicator that reflects cross‑modal agreement, warning users when the model’s answer varies across formats.

Limitations & Future Work

Scope of content: Benchmarks focus on factual statements; reasoning‑heavy or narrative content may exhibit different inconsistency patterns.
Language coverage: All prompts are in English; multilingual consistency remains unexplored.
Static evaluation: The study does not assess how fine‑tuning on consistency‑oriented data would shift the modality gap.
Hardware constraints: Some large models could not be evaluated on the full benchmark due to GPU memory limits, potentially biasing the sample toward smaller models.

Future research directions include extending REST+ to multilingual and multimodal‑reasoning tasks, developing training objectives that directly minimize the modality gap, and exploring dynamic token‑allocation strategies to mitigate vision‑token bottlenecks.

Authors

Angela van Sprang
Laurens Samson
Ana Lucic
Erman Acar
Sennay Ghebreab
Yuki M. Asano

Paper Information

arXiv ID: 2512.08923v1
Categories: cs.AI
Published: December 9, 2025
PDF: Download PDF

[Paper] Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously