[Paper] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Published: (February 26, 2026 at 01:54 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.23351v1

Overview

Vision‑Language Models (VLMs) such as OpenCLIP, LLaVA‑1.5, and Molmo have achieved impressive performance on image captioning and multimodal retrieval, yet they still stumble when asked to reason about space, time, negation, or counting. This paper argues that the root cause is reporting bias in the massive web‑scale datasets used to train these models: people tend to describe only the “interesting” parts of an image, leaving out the tacit information that would be needed for deeper reasoning. By examining the data through a pragmatic lens, the authors show that simply scaling up data or model size does not magically fill this gap.

Key Contributions

  • Identify reporting bias as a systematic omission of tacit visual details in caption corpora, linking it to four core reasoning skills (spatial, temporal, negation, counting).
  • Quantify the bias across three widely‑used VLM training corpora (OpenCLIP, LLaVA‑1.5, Molmo) using pragmatic theory‑inspired metrics.
  • Curate targeted benchmarks that isolate each of the four reasoning abilities, revealing consistent performance drops across model sizes and languages.
  • Demonstrate that scaling alone fails: larger datasets, bigger models, and multilingual pre‑training do not lead to emergent reasoning capabilities.
  • Show that explicit annotations help: adding a modest amount of “tacit‑information” labels dramatically improves reasoning performance, confirming the need for intentional data curation.

Methodology

  1. Pragmatic Lens – The authors borrow concepts from linguistic pragmatics (e.g., Grice’s maxims) to define what counts as “tacit” information that speakers typically leave unsaid.
  2. Bias Audits – For each training corpus, they compute the frequency of captions that contain explicit spatial descriptors, temporal cues, negations, or numeric counts versus those that omit them.
  3. Benchmark Construction – Four diagnostic suites are built:
    • Spatial: questions about relative positions (e.g., “Is the cat left of the couch?”)
    • Temporal: ordering events (e.g., “Did the person arrive before the rain started?”)
    • Negation: detecting absence (e.g., “Is there no dog in the scene?”)
    • Counting: exact object numbers (e.g., “How many chairs are visible?”)
      Each suite contains image‑question pairs where the answer hinges on the missing tacit detail.
  4. Model Evaluation – State‑of‑the‑art VLMs are tested on these suites using zero‑shot prompting and few‑shot fine‑tuning.
  5. Intervention Study – The authors augment the original training data with a small, manually curated set of “tacit‑rich” annotations and re‑train/fine‑tune the models to measure gains.

Results & Findings

Reasoning SkillBaseline VLM Performance (zero‑shot)After Scaling (larger data/model)With Tacit‑Rich Annotations
Spatial58 % accuracy60 % (no significant jump)78 %
Temporal52 %53 %71 %
Negation49 %50 %69 %
Counting45 %46 %73 %
  • Reporting bias is pervasive: even the largest web‑scale corpora contain <30 % of captions with explicit spatial or temporal cues.
  • Scaling does not compensate: models with up to 1 B parameters and trained on >10 B image‑text pairs still fail to close the gap.
  • Targeted data fixes the problem: adding as little as 0.5 % of tacit‑rich examples yields 15–25 % absolute improvements across all reasoning categories.

Practical Implications

  • Data Curation Over Scale – Teams building VLMs for applications like robotics, AR/VR, or content moderation should prioritize quality of annotations (e.g., explicit spatial tags, event timestamps) rather than just amassing more web data.
  • Prompt Engineering Limits – Relying on clever prompts to coax reasoning from existing VLMs is unlikely to succeed unless the underlying training data already contains the needed tacit cues.
  • Fine‑Tuning Strategies – A lightweight “reasoning head” trained on a modest, well‑annotated dataset can dramatically boost performance, offering a cost‑effective path for product teams.
  • Evaluation Standards – Incorporating pragmatic‑focused benchmarks into CI pipelines can catch reasoning blind spots early, preventing downstream failures in safety‑critical systems.

Limitations & Future Work

  • Scope of Bias – The study focuses on English‑centric web captions; other languages and domains (e.g., medical imaging) may exhibit different bias patterns.
  • Annotation Cost – While the required tacit‑rich data is small, creating high‑quality annotations still demands expert effort.
  • Model Architecture – The experiments use existing VLM backbones; future work could explore architectures that explicitly model pragmatic inference (e.g., joint vision‑language pragmatics modules).
  • Long‑Term Reasoning – The benchmarks target short‑range reasoning; extending to multi‑step or commonsense chains remains an open challenge.

Bottom line: Bigger datasets won’t magically give VLMs the ability to “read between the lines.” Intentional, pragmatically‑aware data collection is the key to unlocking reliable visual reasoning for real‑world AI systems.

Authors

  • Amita Kamath
  • Jack Hessel
  • Khyathi Chandu
  • Jena D. Hwang
  • Kai-Wei Chang
  • Ranjay Krishna

Paper Information

  • arXiv ID: 2602.23351v1
  • Categories: cs.CL, cs.CV
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...