[Paper] Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Published: (December 9, 2025 at 01:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.08923v1

Overview

The paper “Same Content, Different Answers: Cross‑Modal Inconsistency in MLLMs” uncovers a surprising blind spot in today’s multimodal large language models (MLLMs): even when presented with identical semantic information in text, image, or a mix of both, the models often give different answers. To diagnose and quantify this problem, the authors introduce two new benchmarks—REST and REST+ (Render‑Equivalence Stress Tests)—that systematically probe how consistently MLLMs reason across modalities.

Key Contributions

  • Two novel benchmarks (REST & REST+): Curated sets of triplets (text, image, mixed) that convey the same factual content, enabling direct measurement of cross‑modal consistency.
  • Comprehensive evaluation of 15 state‑of‑the‑art MLLMs: Includes popular open‑source and commercial models, revealing wide variability in consistency scores.
  • In‑depth analysis of visual factors: Demonstrates that text colour, resolution, and the number of vision tokens affect performance, while font style does not.
  • Mechanistic link to modality gap: Shows that a model’s consistency score correlates with the embedding‑space distance between its text and image representations, offering a quantitative diagnostic.
  • Open‑source release: Benchmark data, evaluation scripts, and consistency metrics are publicly available for the community.

Methodology

  1. Benchmark Construction

    • REST: 1,200 semantic facts (e.g., “The Eiffel Tower is in Paris”) rendered as plain text, as a rendered image of the same sentence, and as a mixed prompt (image + text).
    • REST+: Extends REST with stress‑test variations—different text colours, resolutions, and token counts—to probe visual robustness.
  2. Model Selection & Prompting

    • 15 MLLMs spanning vision‑language transformers (e.g., BLIP‑2, LLaVA), instruction‑tuned models (e.g., GPT‑4V), and open‑source alternatives (e.g., MiniGPT‑4).
    • Uniform prompting: “Answer the question based on the content provided.” The same question is asked for each modality of a given fact.
  3. Consistency Scoring

    • Answers are normalized (case‑folding, synonym mapping) and compared pairwise across modalities.
    • Consistency Score = 1 – average pairwise disagreement (0 = completely inconsistent, 1 = perfectly consistent).
  4. Controlled Analyses

    • OCR accuracy is measured separately to isolate pure visual‑embedding effects.
    • Ablation studies vary colour, resolution, and token count while keeping the underlying text constant.

Results & Findings

ModelAvg. Consistency (REST)Avg. Consistency (REST+)
GPT‑4V (proprietary)0.780.71
LLaVA‑1.5‑13B0.550.48
MiniGPT‑4‑7B0.420.35
BLIP‑2‑FlanT5‑XXL0.610.54
  • Large variance: Even top‑tier models differ by >30 % in consistency.
  • OCR isn’t the whole story: After correcting OCR errors, inconsistency persists, indicating deeper representation gaps.
  • Visual attributes matter: Low‑contrast text (e.g., light grey on white) and low‑resolution renders cause up to a 15 % drop in consistency; font style has negligible impact.
  • Token count effect: Images that require more vision tokens (larger or more complex scenes) lead to higher inconsistency, suggesting capacity limits in the visual encoder.
  • Modality gap correlation: Pearson r = 0.68 between consistency score and the Euclidean distance of a model’s text vs. image embeddings, supporting the hypothesis that a larger embedding gap drives inconsistency.

Practical Implications

  • Reliability in mixed‑modal pipelines: Developers building applications that switch between OCR‑based text extraction and direct image understanding (e.g., document AI, visual assistants) should not assume interchangeable performance.
  • Benchmark‑driven model selection: REST/REST+ can be incorporated into CI/CD testing to pick models that meet a required consistency threshold for critical use‑cases.
  • Prompt engineering: Adding explicit modality‑agnostic prompts (e.g., “Treat the following content as factual regardless of format”) can modestly improve consistency, but cannot replace architectural fixes.
  • Model design guidance: The correlation with modality gap suggests that future MLLM architectures should enforce tighter alignment between visual and textual encoders—e.g., joint contrastive training with cross‑modal consistency losses.
  • User experience: For end‑users, inconsistent answers can erode trust. UI designers might display a “confidence” indicator that reflects cross‑modal agreement, warning users when the model’s answer varies across formats.

Limitations & Future Work

  • Scope of content: Benchmarks focus on factual statements; reasoning‑heavy or narrative content may exhibit different inconsistency patterns.
  • Language coverage: All prompts are in English; multilingual consistency remains unexplored.
  • Static evaluation: The study does not assess how fine‑tuning on consistency‑oriented data would shift the modality gap.
  • Hardware constraints: Some large models could not be evaluated on the full benchmark due to GPU memory limits, potentially biasing the sample toward smaller models.

Future research directions include extending REST+ to multilingual and multimodal‑reasoning tasks, developing training objectives that directly minimize the modality gap, and exploring dynamic token‑allocation strategies to mitigate vision‑token bottlenecks.

Authors

  • Angela van Sprang
  • Laurens Samson
  • Ana Lucic
  • Erman Acar
  • Sennay Ghebreab
  • Yuki M. Asano

Paper Information

  • arXiv ID: 2512.08923v1
  • Categories: cs.AI
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »