[Paper] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

Published: 1 month ago (December 24, 2025 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21329v1

Overview

The paper Your Reasoning Benchmark May Not Test Reasoning examines why modern vision‑language models (VLMs) stumble on abstract reasoning suites such as ARC, ARC‑AGI, Mini‑ARC, ACRE, and Bongard‑LOGO. Instead of blaming “weak reasoning,” the authors show that most errors come from the models’ inability to perceive the visual input accurately. By explicitly separating perception from reasoning, they reveal a hidden bottleneck that inflates the perceived gap between human and machine reasoning abilities.

Key Contributions

Two‑stage evaluation pipeline that first converts each image into a natural‑language description (perception) and then runs a rule‑induction model on the textual descriptions (reasoning).
Systematic comparison of the two‑stage pipeline against traditional end‑to‑end VLMs across three ARC‑style benchmarks, quantifying the relative impact of perception vs. reasoning.
Empirical evidence that ≈ 80 % of failures in VLMs are traceable to perception errors, not to flawed logical inference.
Critical analysis of why current abstract‑reasoning benchmarks conflate visual perception and logical reasoning, calling for redesigned evaluation protocols.

Methodology

Dataset Selection – The authors work with three widely used abstract‑reasoning datasets: Mini‑ARC, ACRE, and Bongard‑LOGO. Each task presents a pair (or set) of input images and asks the model to produce a correct output image.
Perception Stage – For every image, a strong vision encoder (e.g., CLIP‑ViT or a fine‑tuned object detector) generates a concise natural‑language caption describing shapes, colors, spatial relations, etc. This step is performed independently for each image, guaranteeing no cross‑image leakage.
Reasoning Stage – A language‑only model (e.g., GPT‑4 or a fine‑tuned T5) receives the textual descriptions of the inputs and the target output (when available) and is tasked with inferring the underlying rule and applying it to produce the description of the answer image.
Baseline Comparison – The same tasks are also solved with a conventional end‑to‑end VLM that directly maps raw pixels to the answer image, representing the “one‑stage” approach used in most prior work.
Error Analysis – The authors manually inspect reasoning traces (the chain‑of‑thought generated by the language model) to categorize failures as perception‑related or reasoning‑related.

Results & Findings

Benchmark	End‑to‑End VLM Accuracy	Two‑Stage (Perception + Reasoning) Accuracy
Mini‑ARC	~12 %	~45 % (≈ 3.7× boost)
ACRE	~8 %	~38 % (≈ 4.8× boost)
Bongard‑LOGO	~15 %	~52 % (≈ 3.5× boost)

Perception dominates: When the perception module is strong (high‑quality captions), the reasoning model solves a large fraction of tasks that the end‑to‑end VLM cannot.
Error breakdown: Manual inspection of 500 failed VLM attempts shows ~80 % stem from missed or mis‑described visual elements (e.g., “missing a small red triangle”). Only ~20 % are genuine reasoning mistakes.
Leakage control: Because each image is captioned independently, the reasoning stage cannot cheat by borrowing visual clues from other inputs, confirming that the performance gain truly comes from better perception.

Practical Implications

Benchmark redesign – Developers building AI agents for “general intelligence” should treat ARC‑style suites as perception‑augmented tasks, not pure logic tests. Future benchmarks might provide explicit visual descriptors or separate perception scores.
Model architecture – Investing in stronger, modular vision encoders (e.g., region‑level detectors, scene graph generators) can yield outsized gains on abstract reasoning problems, often more cost‑effective than scaling up the reasoning component.
Debugging pipelines – The two‑stage framework offers a clear diagnostic tool: if a model fails, check the caption first. This can accelerate iteration cycles for VLM developers.
Transfer learning – High‑quality visual descriptions can be reused across downstream tasks (e.g., program synthesis from screenshots, robotic instruction following), making the perception module a reusable asset.
Evaluation hygiene – Companies benchmarking their VLMs should report both perception accuracy (caption quality) and reasoning accuracy to avoid over‑claiming “reasoning” capabilities.

Limitations & Future Work

Caption quality ceiling – The study relies on existing vision models for captioning; any residual perception errors still limit the upper bound of reasoning performance.
Dataset scope – Only three ARC‑style datasets were examined; other abstract‑reasoning benchmarks (e.g., CLEVR, RAVEN) may exhibit different perception‑reasoning balances.
Human‑like abstraction – Converting images to text may discard low‑level visual nuances that humans use implicitly; future work could explore richer symbolic representations (scene graphs, programmatic sketches).
End‑to‑end integration – While modularization clarifies bottlenecks, the ultimate goal remains a unified model that jointly learns perception and reasoning without performance loss; bridging the gap is an open research direction.

Authors

Xinhe Wang
Jin Huang
Xingjian Zhang
Tianhao Wang
Jiaqi W. Ma

Paper Information

arXiv ID: 2512.21329v1
Categories: cs.CL
Published: December 24, 2025
PDF: Download PDF

[Paper] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents