[Paper] OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Published: 1 day ago (April 22, 2026 at 01:37 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.20806v1

Overview

The paper introduces OMIBench, a new benchmark that pushes large vision‑language models (LVLMs) to reason across multiple images—a skill that mirrors how scientists solve Olympiad‑level problems by piecing together evidence from several diagrams, graphs, or experimental photos. By focusing on multi‑image contexts, the authors expose a blind spot in existing evaluations, which mostly test single‑image understanding.

Key Contributions

Multi‑image reasoning benchmark: Curated 1,200+ Olympiad‑style questions from biology, chemistry, mathematics, and physics that require synthesizing information spread over two or more images.
Human‑annotated rationales: Each problem includes step‑by‑step explanations, enabling fine‑grained analysis of model reasoning paths.
Dual evaluation protocol: Provides both exact‑match scoring and semantic‑match scoring (using LLM‑based answer equivalence) to capture nuanced correctness.
Comprehensive LVLM assessment: Benchmarks a wide spectrum of models—from open‑source LLaVA‑13B to proprietary Gemini‑3‑Pro—revealing a consistent ~50% ceiling even for the strongest systems.
Open‑source release: Dataset, annotation files, and evaluation scripts are publicly available, encouraging community‑driven improvements.

Methodology

Problem Collection – The authors mined past Olympiad exams and selected questions whose solutions explicitly reference multiple visual artifacts (e.g., a chemical reaction diagram plus a microscopy image).
Annotation Pipeline – Domain experts wrote detailed rationales, marking which image contributes which piece of evidence. These rationales serve both as ground truth and as a training signal for future fine‑tuning.
Prompt Design – For each test item, the model receives a concatenated prompt containing all relevant images (encoded as vision tokens) and the textual question. No extra “image‑index” hints are given, forcing the model to discover cross‑image links autonomously.
Scoring
- Exact match: The model’s textual answer is compared verbatim to the gold answer.
- Semantic match: An LLM (GPT‑4) judges whether the answer conveys the same scientific conclusion, tolerating paraphrases.
Baseline Experiments – The authors evaluated 12 LVLMs, measuring both overall accuracy and per‑domain performance, and performed ablation studies (e.g., removing one image) to quantify the contribution of multi‑image context.

Results & Findings

Model	Exact‑Match Accuracy	Semantic‑Match Accuracy
LLaVA‑13B	22%	31%
InstructBLIP‑7B	28%	38%
Gemini‑1‑Pro	44%	52%
Gemini‑3‑Pro (best)	48%	55%

Performance Gap: Even top‑tier LVLMs fall short of human‑level performance (~95% on the same set).
Domain Variance: Physics and chemistry questions see the biggest drop, likely because they rely heavily on interpreting multiple plots or experimental setups.
Ablation Insight: Removing any single image drops accuracy by ~12‑15%, confirming that models truly need to fuse information rather than guessing from a dominant visual cue.
Rationale Alignment: Models that generate intermediate reasoning steps (e.g., chain‑of‑thought prompting) achieve modest gains (~4% absolute), suggesting that explicit reasoning helps but is not sufficient.

Practical Implications

Productivity Tools: Developers building AI assistants for scientific research (lab notebooks, educational platforms) should anticipate that current LVLMs may miss critical cross‑image cues, leading to incomplete or incorrect suggestions.
Safety‑Critical Systems: In domains like medical imaging or industrial inspection, decisions often depend on correlating multiple scans (e.g., MRI slices, before/after photos). OMIBench highlights that relying on off‑the‑shelf LVLMs could be risky without additional verification layers.
Fine‑Tuning Strategies: The annotated rationales provide a ready‑made curriculum for supervised fine‑tuning or reinforcement learning from human feedback (RLHF) focused on multi‑image reasoning.
Benchmark‑Driven Development: Companies can adopt OMIBench as a regression suite to track improvements in their vision‑language pipelines, ensuring that new model releases genuinely advance multi‑image understanding.
API Design: When exposing LVLM capabilities via APIs, offering explicit “multi‑image context” flags or allowing developers to supply image ordering metadata could help models allocate attention more effectively.

Limitations & Future Work

Scope of Olympiad Problems: While Olympiad questions are challenging, they represent a narrow slice of real‑world tasks; extending the benchmark to industrial case studies (e.g., multi‑camera surveillance) would broaden relevance.
Image Quantity: Most items involve two or three images; scaling to larger sets (dozens of satellite tiles, video frames) may uncover additional bottlenecks.
Evaluation Dependence on LLM Judgement: Semantic matching relies on a separate LLM, which could introduce bias; future work could incorporate human verification for a subset of answers.
Model Architecture: The study focuses on transformer‑based LVLMs; exploring hybrid architectures (e.g., graph‑based reasoning over image embeddings) could yield better multi‑image fusion.
Training Data Gaps: The authors note that many public LVLM pre‑training corpora contain few multi‑image examples, suggesting that curated multi‑image datasets are needed for pre‑training, not just fine‑tuning.

By spotlighting these gaps, OMIBench sets the stage for the next generation of vision‑language models that can truly “see the whole picture.”

Authors

Qiguang Chen
Chengyu Luan
Jiajun Wu
Qiming Yu
Yi Yang
Yizhuo Li
Jingqi Tong
Xiachong Feng
Libo Qin
Wanxiang Che

Paper Information

arXiv ID: 2604.20806v1
Categories: cs.CV, cs.AI, cs.CL
Published: April 22, 2026
PDF: Download PDF

[Paper] OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation