[Paper] MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Published: 1 day ago (February 2, 2026 at 01:49 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.02465v1

Overview

The paper introduces MentisOculi, a benchmark designed to test whether modern multimodal models can think with visual imagery the way humans do—forming, holding, and manipulating mental pictures to aid multi‑step reasoning. By probing state‑of‑the‑art unified multimodal models (UMMs) and large language models with visual extensions, the authors reveal that current visual “thoughts” rarely improve problem‑solving performance.

Key Contributions

MentisOculi benchmark – a procedurally generated, stratified suite of multi‑step reasoning tasks that can be solved either purely textually or with intermediate visualizations.
Comprehensive evaluation of a wide range of visual strategies, from latent token‑based “mental images” to explicit image generation, across several frontier models (e.g., GPT‑4V, LLaVA, Gemini).
Empirical finding that visual intermediates do not boost reasoning accuracy; in many cases they even degrade performance due to error compounding.
Diagnostic analysis showing that UMMs can often generate correct final answers and plausible visuals, yet they fail to integrate the two—e.g., they cannot leverage ground‑truth visualizations to improve textual reasoning.
Open‑source release of the benchmark code and a set of diagnostic tools to help the community measure and close the gap between visual generation and visual reasoning.

Methodology

Task Design – Each problem is a multi‑step logical puzzle (e.g., geometry, spatial planning, diagram‑based deduction) that admits a visual solution. Tasks are grouped into difficulty tiers and automatically generated to ensure diversity and scalability.
Model Variants – The authors test three families:
- Pure LLMs (text‑only).
- Latent‑visual models that keep an internal visual token stream (no explicit image output).
- Explicit‑visual models that generate an image at each reasoning step.
Prompting Protocol – For each step, models receive a prompt describing the current sub‑goal and, when applicable, the previously generated visual (or a ground‑truth visual for ablation).
Evaluation Metrics – Accuracy of the final answer, visual fidelity (when an image is produced), and a visual‑integration score that measures whether the visual actually influences the subsequent textual reasoning.
Error Analysis – The authors trace failure modes: token drift in latent representations, image generation artifacts, and the inability of the language component to condition on visual inputs.

Results & Findings

Model family	Final answer accuracy (no visual)	Accuracy with latent visual	Accuracy with explicit visual
LLM‑only	68 %	–	–
Latent‑visual	70 %	62 % (drop)	–
Explicit‑visual	71 %	–	58 % (drop)

No performance gain from adding visual steps; in fact, both latent and explicit visual strategies suffer a 10–15 % drop in accuracy.
When fed ground‑truth visualizations, UMMs still fail to improve, indicating a disconnect between the visual encoder and the reasoning engine.
Visual outputs are often plausible (high image quality) but misaligned with the logical state needed for the next reasoning step, leading to compounding errors.
The benchmark’s stratification shows the gap widens on higher‑difficulty tiers, where multi‑step planning is more demanding.

Practical Implications

Tooling for developers – If you’re building AI assistants that rely on “thinking with pictures” (e.g., code‑to‑diagram generators, design assistants, robotics planners), this work warns that current UMMs won’t reliably use intermediate images to boost reasoning.
Model integration – Systems that chain a language model with a separate vision module (e.g., generate a diagram, then ask the LLM to interpret it) may need explicit hand‑off mechanisms rather than trusting the model to self‑coordinate.
Benchmark adoption – MentisOculi can serve as a regression test for any new multimodal architecture that claims visual reasoning capabilities, helping teams spot integration bugs early.
Product roadmaps – Companies aiming for truly multimodal agents should prioritize joint training or cross‑modal attention mechanisms that tightly bind visual and textual pathways, rather than treating visual generation as an afterthought.

Limitations & Future Work

The benchmark focuses on synthetic, geometry‑style problems; real‑world domains (e.g., medical imaging, CAD) may exhibit different dynamics.
Only a limited set of UMMs were evaluated; newer models released after the study could behave differently.
The analysis does not explore fine‑tuning strategies that explicitly teach models to condition on visual feedback—an avenue the authors suggest for future research.
Extending MentisOculi to include interactive visual reasoning (e.g., agents that can edit images) and to measure efficiency (compute cost of visual steps) are identified next steps.

Authors

Jana Zeller
Thaddäus Wiedemer
Fanfei Li
Thomas Klein
Prasanna Mayilvahanan
Matthias Bethge
Felix Wichmann
Ryan Cotterell
Wieland Brendel

Paper Information

arXiv ID: 2602.02465v1
Categories: cs.AI, cs.CV, cs.LG
Published: February 2, 2026
PDF: Download PDF

[Paper] MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

[Paper] Multi-head automated segmentation by incorporating detection head into the contextual layer neural network

[Paper] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

[Paper] ReasonEdit: Editing Vision-Language Models using Human Reasoning