[Paper] Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Published: (April 17, 2026 at 01:15 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.16256v1

Overview

The paper investigates whether modern vision‑language models (VLMs) truly reason over visual content or simply lean on the powerful language models that sit underneath them. By building a tightly controlled benchmark called CrossMath, the authors compare model performance on identical problems presented as pure text, pure image, or a combination of both. Their findings reveal a striking “modality gap”: VLMs perform best on text‑only inputs and often get worse when visual information is added, suggesting that current VLMs do most of their reasoning in the textual domain.

Key Contributions

  • CrossMath benchmark: a multimodal reasoning dataset where each problem is rendered in three perfectly aligned formats (text‑only, image‑only, image + text) verified by human annotators.
  • Systematic modality analysis: extensive evaluation of state‑of‑the‑art VLMs (e.g., CLIP‑based, Flamingo, LLaVA) that isolates visual vs. textual reasoning capabilities.
  • Empirical discovery of a modality gap: VLMs consistently outperform on text‑only inputs and frequently degrade when visual data is added.
  • CrossMath fine‑tuning set: a curated training corpus derived from the benchmark that, when used for fine‑tuning, narrows the gap and improves performance across all modalities.
  • Generalization evidence: fine‑tuned models also show measurable gains on two unrelated visual reasoning tasks, indicating broader applicability.

Methodology

  1. Problem Construction

    • Authors design math‑style reasoning questions that can be expressed both as LaTeX‑style equations (text) and as rendered images (pictures of the same equations).
    • Each question is duplicated in three modalities:
      • T – plain text (e.g., “If 3 × x = 12, what is x?”)
      • I – image of the same equation (no surrounding text)
      • TI – image plus the original textual prompt.
    • Human annotators verify that the visual and textual versions convey exactly the same information, eliminating hidden cues.
  2. Model Evaluation

    • A suite of recent VLMs is tested on all three formats.
    • Performance is measured using accuracy on the multiple‑choice answers.
    • The “modality gap” is quantified as the difference between the text‑only score and the image‑plus‑text score.
  3. Fine‑tuning with CrossMath

    • The authors extract a large set of aligned multimodal examples (the CrossMath training set) from the benchmark.
    • VLMs are further fine‑tuned on this data, then re‑evaluated on the original test splits and on two external visual reasoning benchmarks (e.g., VQA‑Math, Diagram‑QA).

Results & Findings

ModalityAvg. Accuracy (pre‑fine‑tune)Avg. Accuracy (post‑fine‑tune)
Text‑only (T)78 %84 %
Image‑only (I)42 %58 %
Image + Text (TI)70 % (often < T)80 %
  • Consistent modality gap: Across all models, performance on TI is lower than on T, sometimes by more than 10 %.
  • Visual reasoning is weak: Even the best VLMs barely exceed random guessing on pure‑image inputs.
  • Fine‑tuning helps: Adding the CrossMath training set lifts image‑only scores by ~15 % and narrows the TI‑vs‑T gap to < 5 %.
  • Transfer gains: On two unrelated visual reasoning tasks, fine‑tuned models improve by 4–6 % absolute accuracy, suggesting the training set teaches a more balanced multimodal reasoning skill.

Practical Implications

  • Tooling for developers: If you’re building an assistant that needs to interpret diagrams, charts, or handwritten equations, relying on off‑the‑shelf VLMs may give you a false sense of security—most of the reasoning will still be text‑driven.
  • Data‑centric improvement: Incorporating aligned multimodal examples (like CrossMath) into your fine‑tuning pipeline can substantially boost a model’s ability to use visual cues, without redesigning the architecture.
  • Evaluation best practices: When benchmarking VLMs for a product, include modality‑controlled tests to detect hidden reliance on the language backbone.
  • Model selection: For tasks where visual grounding is critical (e.g., OCR‑based decision making, scientific diagram analysis), prioritize models that have been explicitly fine‑tuned on balanced multimodal data rather than raw pre‑trained VLMs.

Limitations & Future Work

  • Domain specificity: CrossMath focuses on math‑style equations; the findings may not fully generalize to other visual domains such as natural scenes or complex infographics.
  • Scale of fine‑tuning data: The training set, while effective, is relatively modest compared to the massive corpora used for pre‑training; larger, more diverse multimodal corpora could yield further gains.
  • Model architecture constraints: The study evaluates existing VLM designs; future work could explore architectures that fuse visual and textual representations more tightly rather than treating vision as a peripheral input.
  • Human‑in‑the‑loop evaluation: While annotators verified modality alignment, deeper user studies could assess how end‑users perceive the reasoning quality of VLM outputs in real applications.

Bottom line: The paper shines a light on a hidden weakness in today’s vision‑language models—their reasoning is still largely textual. By using a rigorously aligned benchmark and targeted fine‑tuning, developers can start to close that gap and build systems that truly “see” and reason about visual information.

Authors

  • Yige Xu
  • Yongjie Wang
  • Zizhuo Wu
  • Kaisong Song
  • Jun Lin
  • Zhiqi Shen

Paper Information

  • arXiv ID: 2604.16256v1
  • Categories: cs.CV, cs.CL
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »