[Paper] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks
Source: arXiv - 2512.21329v1
Overview
The paper Your Reasoning Benchmark May Not Test Reasoning examines why modern vision‑language models (VLMs) stumble on abstract reasoning suites such as ARC, ARC‑AGI, Mini‑ARC, ACRE, and Bongard‑LOGO. Instead of blaming “weak reasoning,” the authors show that most errors come from the models’ inability to perceive the visual input accurately. By explicitly separating perception from reasoning, they reveal a hidden bottleneck that inflates the perceived gap between human and machine reasoning abilities.
Key Contributions
- Two‑stage evaluation pipeline that first converts each image into a natural‑language description (perception) and then runs a rule‑induction model on the textual descriptions (reasoning).
- Systematic comparison of the two‑stage pipeline against traditional end‑to‑end VLMs across three ARC‑style benchmarks, quantifying the relative impact of perception vs. reasoning.
- Empirical evidence that ≈ 80 % of failures in VLMs are traceable to perception errors, not to flawed logical inference.
- Critical analysis of why current abstract‑reasoning benchmarks conflate visual perception and logical reasoning, calling for redesigned evaluation protocols.
Methodology
- Dataset Selection – The authors work with three widely used abstract‑reasoning datasets: Mini‑ARC, ACRE, and Bongard‑LOGO. Each task presents a pair (or set) of input images and asks the model to produce a correct output image.
- Perception Stage – For every image, a strong vision encoder (e.g., CLIP‑ViT or a fine‑tuned object detector) generates a concise natural‑language caption describing shapes, colors, spatial relations, etc. This step is performed independently for each image, guaranteeing no cross‑image leakage.
- Reasoning Stage – A language‑only model (e.g., GPT‑4 or a fine‑tuned T5) receives the textual descriptions of the inputs and the target output (when available) and is tasked with inferring the underlying rule and applying it to produce the description of the answer image.
- Baseline Comparison – The same tasks are also solved with a conventional end‑to‑end VLM that directly maps raw pixels to the answer image, representing the “one‑stage” approach used in most prior work.
- Error Analysis – The authors manually inspect reasoning traces (the chain‑of‑thought generated by the language model) to categorize failures as perception‑related or reasoning‑related.
Results & Findings
| Benchmark | End‑to‑End VLM Accuracy | Two‑Stage (Perception + Reasoning) Accuracy |
|---|---|---|
| Mini‑ARC | ~12 % | ~45 % (≈ 3.7× boost) |
| ACRE | ~8 % | ~38 % (≈ 4.8× boost) |
| Bongard‑LOGO | ~15 % | ~52 % (≈ 3.5× boost) |
- Perception dominates: When the perception module is strong (high‑quality captions), the reasoning model solves a large fraction of tasks that the end‑to‑end VLM cannot.
- Error breakdown: Manual inspection of 500 failed VLM attempts shows ~80 % stem from missed or mis‑described visual elements (e.g., “missing a small red triangle”). Only ~20 % are genuine reasoning mistakes.
- Leakage control: Because each image is captioned independently, the reasoning stage cannot cheat by borrowing visual clues from other inputs, confirming that the performance gain truly comes from better perception.
Practical Implications
- Benchmark redesign – Developers building AI agents for “general intelligence” should treat ARC‑style suites as perception‑augmented tasks, not pure logic tests. Future benchmarks might provide explicit visual descriptors or separate perception scores.
- Model architecture – Investing in stronger, modular vision encoders (e.g., region‑level detectors, scene graph generators) can yield outsized gains on abstract reasoning problems, often more cost‑effective than scaling up the reasoning component.
- Debugging pipelines – The two‑stage framework offers a clear diagnostic tool: if a model fails, check the caption first. This can accelerate iteration cycles for VLM developers.
- Transfer learning – High‑quality visual descriptions can be reused across downstream tasks (e.g., program synthesis from screenshots, robotic instruction following), making the perception module a reusable asset.
- Evaluation hygiene – Companies benchmarking their VLMs should report both perception accuracy (caption quality) and reasoning accuracy to avoid over‑claiming “reasoning” capabilities.
Limitations & Future Work
- Caption quality ceiling – The study relies on existing vision models for captioning; any residual perception errors still limit the upper bound of reasoning performance.
- Dataset scope – Only three ARC‑style datasets were examined; other abstract‑reasoning benchmarks (e.g., CLEVR, RAVEN) may exhibit different perception‑reasoning balances.
- Human‑like abstraction – Converting images to text may discard low‑level visual nuances that humans use implicitly; future work could explore richer symbolic representations (scene graphs, programmatic sketches).
- End‑to‑end integration – While modularization clarifies bottlenecks, the ultimate goal remains a unified model that jointly learns perception and reasoning without performance loss; bridging the gap is an open research direction.
Authors
- Xinhe Wang
- Jin Huang
- Xingjian Zhang
- Tianhao Wang
- Jiaqi W. Ma
Paper Information
- arXiv ID: 2512.21329v1
- Categories: cs.CL
- Published: December 24, 2025
- PDF: Download PDF