[Paper] Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Published: 2 months ago (February 12, 2026 at 12:29 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.12196v1

Overview

A new benchmark called Visual Reasoning Benchmark (VRB) puts Multimodal Large Language Models (MLLMs) to the test on authentic primary‑school visual math problems. By pulling 701 real exam items from Zambia and India, the authors expose a “spatial ceiling” in current models – they can count and scale, but they stumble on operations such as folding, reflection, and rotation that are commonplace in early‑grade maths.

Key Contributions

Real‑world dataset: 701 unedited classroom‑style visual questions covering analogy, pattern completion, spatial matching, and more.
Multimodal evaluation protocol: Standardized prompts and scoring that treat the image as a first‑class input, mirroring how teachers would present a problem.
Capability map: Empirical evidence of a “jagged frontier” – strong performance on static visual skills, sharp drop‑off on dynamic spatial transformations.
Risk analysis for education: Discussion of how mis‑scored answers could reinforce misconceptions, providing a concrete motivation for domain‑specific benchmarks.
Open‑source release: Dataset, evaluation scripts, and baseline results are publicly available for reproducibility and community extensions.

Methodology

Data collection – Exam questions were harvested from publicly released primary‑school assessments in Zambia and India. Images were kept in their original, low‑resolution form (no cropping, no added annotations).
Task definition – Each item is framed as a multiple‑choice question (4‑option) where the model must output the correct letter (A‑D).
Model suite – The authors evaluated several state‑of‑the‑art MLLMs (e.g., GPT‑4V, LLaVA, MiniGPT‑4) using a zero‑shot prompt that simply presents the image and asks for the answer.
Scoring – Accuracy is computed per skill category (counting, scaling, folding, etc.) to surface fine‑grained strengths and weaknesses.
Error analysis – Qualitative inspection of failure cases highlights systematic misunderstandings of geometric transformations.

Results & Findings

Skill Category	Best Model Accuracy	Typical Gap to Human (≈100%)
Counting & scaling	92%	~8%
Analogy (static patterns)	78%	~22%
Folding / unfolding	41%	~59%
Reflection / rotation	38%	~62%
Multi‑step spatial reasoning	33%	~67%

Static visual reasoning (e.g., “how many apples?”) is near‑human.
Dynamic transformations (folding a shape, mirroring a pattern) cause a steep performance drop, confirming the “spatial ceiling.”
Errors are not random; models often treat a folded shape as the original, or they misinterpret the axis of symmetry, leading to consistent mis‑marking patterns.

Practical Implications

Education tech – Companies building AI‑assisted grading tools should treat visual‑reasoning scores as provisional; a fallback to human review is advisable for any problem involving transformations.
Developer tooling – When integrating MLLMs into classroom assistants (e.g., “show me how to solve this geometry puzzle”), developers must guard against over‑confidence by adding confidence thresholds or explicit verification steps.
Curriculum design – The benchmark highlights which visual concepts are already well‑supported by AI (counting, basic scaling) and which still need human expertise, informing where to focus human‑in‑the‑loop interventions.
Model improvement – The fine‑grained breakdown offers a roadmap for researchers: augment training data with synthetic folding/rotation tasks, incorporate geometry‑aware modules, or fuse symbolic reasoning engines with vision back‑ends.

Limitations & Future Work

Geographic scope – The dataset is limited to two countries; cultural variations in diagram style may affect generalizability.
Zero‑shot setting – No fine‑tuning was performed; future work could explore whether task‑specific adapters close the spatial gap.
Modalities – Only static images were used; extending to interactive or 3‑D visualizations (e.g., AR manipulatives) could better reflect modern classroom tools.
Human baseline – While the authors assume near‑perfect human performance, a formal study with teachers would solidify the benchmark’s upper bound.

The VRB opens a practical pathway for developers and educators to gauge where multimodal LLMs truly help—and where they still need a human touch.

Authors

Mohamed Huti
Alasdair Mackintosh
Amy Waldock
Dominic Andrews
Maxime Lelièvre
Moritz Boos
Tobias Murray
Paul Atherton
Robin A. A. Ince
Oliver G. B. Garrod

Paper Information

arXiv ID: 2602.12196v1
Categories: cs.CL, cs.AI
Published: February 12, 2026
PDF: Download PDF

[Paper] Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[Paper] SCOPE: Selective Conformal Optimized Pairwise LLM Judging