[Paper] More Images, More Problems? A Controlled Analysis of VLM Failure Modes

Published: 1 week ago (January 12, 2026 at 01:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.07812v1

Overview

Large Vision‑Language Models (LVLMs) have become the go‑to tools for tasks that blend images and text, but most research has focused on single‑image inputs. The new MIMIC benchmark shines a light on how these models behave when they have to reason over multiple images—a scenario that’s increasingly common in real‑world applications such as product catalogs, medical reports, and visual QA systems. By systematically probing LVLMs, the authors expose key failure modes and propose concrete fixes that push the state of the art forward.

Key Contributions

MIMIC benchmark: a rigorously curated suite of multi‑image tasks that isolates specific reasoning challenges (e.g., cross‑image aggregation, simultaneous concept tracking).
Diagnostic analysis: extensive experiments that map out where current LVLMs stumble, revealing systematic weaknesses in attention and information fusion.
Procedural multi‑image data generation: a scalable recipe that transforms single‑image annotations into rich, targeted multi‑image training examples without manual labeling.
Layer‑wise attention‑masking scheme: an optimization technique that reshapes the model’s attention patterns to better handle multiple visual streams.
Empirical gains: the combined data‑ and optimization‑level interventions improve cross‑image reasoning on MIMIC and lift performance on existing multi‑image benchmarks, setting a new SOTA across several tasks.

Methodology

Benchmark Construction
- The authors start from existing single‑image datasets (e.g., COCO, Visual Genome) and programmatically stitch together sets of 2–5 images that share a common query (e.g., “compare the colors of the two shirts”).
- Each MIMIC instance includes a natural‑language prompt, the image set, and a ground‑truth answer, allowing precise measurement of specific capabilities (aggregation, tracking, etc.).
Diagnostic Experiments
- Using off‑the‑shelf LVLMs (e.g., BLIP‑2, InstructBLIP), they probe four failure axes:
  (a) inability to aggregate facts across images,
  (b) loss of individual object references,
  (c) attention collapse onto a single image, and
  (d) confusion when multiple concepts appear simultaneously.
- Attention maps and hidden‑state analyses are visualized layer‑by‑layer to pinpoint where the breakdown occurs.
Remedy 1 – Procedural Data Generation
- A script automatically creates multi‑image training pairs by concatenating single‑image captions and injecting relational cues (e.g., “while the left image shows X, the right image shows Y”).
- This synthetic data is mixed with the original single‑image corpus, exposing the model to the multi‑image pattern during pre‑training.
Remedy 2 – Attention‑Masking for Multi‑Image Inputs
- The authors examine the self‑attention matrix and discover that early transformer layers tend to focus on intra‑image tokens, ignoring cross‑image connections.
- They introduce a lightweight mask that forces a fraction of attention heads to attend across image boundaries, encouraging the model to learn cross‑image relationships without altering the overall architecture.
Training & Evaluation
- Models are fine‑tuned on the combined dataset with the masking scheme applied.
- Performance is reported on MIMIC and on three public multi‑image benchmarks (e.g., Multi‑Modal VQA, Image‑Set Retrieval) to verify generalization.

Results & Findings

Metric	Baseline LVLM	+ Procedural Data	+ Attention Mask	+ Both (Full Method)
Cross‑image aggregation accuracy (MIMIC)	42.1 %	55.8 %	58.3 %	68.9 %
Multi‑image VQA (overall)	61.4 %	66.2 %	67.0 %	73.5 %
Concept‑tracking F1 (MIMIC)	48.7 %	60.1 %	61.4 %	71.2 %

Cross‑image aggregation jumps by > 25 points when both remedies are combined, confirming that the model learns to synthesize information across images.
Attention analysis shows a 30 % increase in cross‑image attention weights after masking, aligning the qualitative observations with quantitative gains.
The improvements transfer to other benchmarks, indicating that the fixes are not over‑fitted to MIMIC alone.

Practical Implications

E‑commerce & Catalog Management – Systems that need to compare product images (e.g., “which of these shoes is more durable?”) can now rely on LVLMs that truly aggregate visual evidence, reducing the need for handcrafted feature pipelines.
Medical Imaging – Radiology reports often reference multiple scans (CT, MRI, X‑ray). A multi‑image‑aware LVLM can generate more coherent summaries and assist in differential diagnosis.
Content Moderation – Detecting policy violations that span several images (e.g., coordinated misinformation memes) becomes feasible when the model can reason across the set.
Developer Tooling – The procedural data‑generation script is open‑source, enabling teams to augment their own training corpora with multi‑image examples without costly annotation.
Model Architecture Choices – The attention‑masking technique is lightweight (no extra parameters) and can be dropped into existing transformer‑based LVLMs, offering an easy win for products that already use such models.

Limitations & Future Work

Synthetic vs. Real‑World Data – The procedural generation pipeline creates plausible multi‑image scenarios, but it may not capture the full distribution of natural multi‑image queries found in the wild.
Scalability of Masking – The current mask is static; dynamic, query‑dependent masking could further improve efficiency, especially for very large image sets.
Evaluation Scope – While MIMIC covers a broad set of reasoning tasks, it still focuses on relatively short prompts. Longer, dialog‑style interactions across images remain an open challenge.
Cross‑Modal Generalization – Extending the analysis to video (temporal sequences) or multimodal inputs that include audio could reveal additional failure modes and opportunities for similar remedies.

The authors promise to release the MIMIC benchmark, data‑generation scripts, and code at https://github.com/anurag-198/MIMIC, making it straightforward for the community to build on these findings.

Authors

Anurag Das
Adrian Bulat
Alberto Baldrati
Ioannis Maniadis Metaxas
Bernt Schiele
Georgios Tzimiropoulos
Brais Martinez

Paper Information

arXiv ID: 2601.07812v1
Categories: cs.CV
Published: January 12, 2026
PDF: Download PDF

[Paper] More Images, More Problems? A Controlled Analysis of VLM Failure Modes

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation