[Paper] Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education
Source: arXiv - 2602.12196v1
Overview
A new benchmark called Visual Reasoning Benchmark (VRB) puts Multimodal Large Language Models (MLLMs) to the test on authentic primary‑school visual math problems. By pulling 701 real exam items from Zambia and India, the authors expose a “spatial ceiling” in current models – they can count and scale, but they stumble on operations such as folding, reflection, and rotation that are commonplace in early‑grade maths.
Key Contributions
- Real‑world dataset: 701 unedited classroom‑style visual questions covering analogy, pattern completion, spatial matching, and more.
- Multimodal evaluation protocol: Standardized prompts and scoring that treat the image as a first‑class input, mirroring how teachers would present a problem.
- Capability map: Empirical evidence of a “jagged frontier” – strong performance on static visual skills, sharp drop‑off on dynamic spatial transformations.
- Risk analysis for education: Discussion of how mis‑scored answers could reinforce misconceptions, providing a concrete motivation for domain‑specific benchmarks.
- Open‑source release: Dataset, evaluation scripts, and baseline results are publicly available for reproducibility and community extensions.
Methodology
- Data collection – Exam questions were harvested from publicly released primary‑school assessments in Zambia and India. Images were kept in their original, low‑resolution form (no cropping, no added annotations).
- Task definition – Each item is framed as a multiple‑choice question (4‑option) where the model must output the correct letter (A‑D).
- Model suite – The authors evaluated several state‑of‑the‑art MLLMs (e.g., GPT‑4V, LLaVA, MiniGPT‑4) using a zero‑shot prompt that simply presents the image and asks for the answer.
- Scoring – Accuracy is computed per skill category (counting, scaling, folding, etc.) to surface fine‑grained strengths and weaknesses.
- Error analysis – Qualitative inspection of failure cases highlights systematic misunderstandings of geometric transformations.
Results & Findings
| Skill Category | Best Model Accuracy | Typical Gap to Human (≈100%) |
|---|---|---|
| Counting & scaling | 92% | ~8% |
| Analogy (static patterns) | 78% | ~22% |
| Folding / unfolding | 41% | ~59% |
| Reflection / rotation | 38% | ~62% |
| Multi‑step spatial reasoning | 33% | ~67% |
- Static visual reasoning (e.g., “how many apples?”) is near‑human.
- Dynamic transformations (folding a shape, mirroring a pattern) cause a steep performance drop, confirming the “spatial ceiling.”
- Errors are not random; models often treat a folded shape as the original, or they misinterpret the axis of symmetry, leading to consistent mis‑marking patterns.
Practical Implications
- Education tech – Companies building AI‑assisted grading tools should treat visual‑reasoning scores as provisional; a fallback to human review is advisable for any problem involving transformations.
- Developer tooling – When integrating MLLMs into classroom assistants (e.g., “show me how to solve this geometry puzzle”), developers must guard against over‑confidence by adding confidence thresholds or explicit verification steps.
- Curriculum design – The benchmark highlights which visual concepts are already well‑supported by AI (counting, basic scaling) and which still need human expertise, informing where to focus human‑in‑the‑loop interventions.
- Model improvement – The fine‑grained breakdown offers a roadmap for researchers: augment training data with synthetic folding/rotation tasks, incorporate geometry‑aware modules, or fuse symbolic reasoning engines with vision back‑ends.
Limitations & Future Work
- Geographic scope – The dataset is limited to two countries; cultural variations in diagram style may affect generalizability.
- Zero‑shot setting – No fine‑tuning was performed; future work could explore whether task‑specific adapters close the spatial gap.
- Modalities – Only static images were used; extending to interactive or 3‑D visualizations (e.g., AR manipulatives) could better reflect modern classroom tools.
- Human baseline – While the authors assume near‑perfect human performance, a formal study with teachers would solidify the benchmark’s upper bound.
The VRB opens a practical pathway for developers and educators to gauge where multimodal LLMs truly help—and where they still need a human touch.
Authors
- Mohamed Huti
- Alasdair Mackintosh
- Amy Waldock
- Dominic Andrews
- Maxime Lelièvre
- Moritz Boos
- Tobias Murray
- Paul Atherton
- Robin A. A. Ince
- Oliver G. B. Garrod
Paper Information
- arXiv ID: 2602.12196v1
- Categories: cs.CL, cs.AI
- Published: February 12, 2026
- PDF: Download PDF