[Paper] MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Published: (February 2, 2026 at 01:49 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.02465v1

Overview

The paper introduces MentisOculi, a benchmark designed to test whether modern multimodal models can think with visual imagery the way humans do—forming, holding, and manipulating mental pictures to aid multi‑step reasoning. By probing state‑of‑the‑art unified multimodal models (UMMs) and large language models with visual extensions, the authors reveal that current visual “thoughts” rarely improve problem‑solving performance.

Key Contributions

  • MentisOculi benchmark – a procedurally generated, stratified suite of multi‑step reasoning tasks that can be solved either purely textually or with intermediate visualizations.
  • Comprehensive evaluation of a wide range of visual strategies, from latent token‑based “mental images” to explicit image generation, across several frontier models (e.g., GPT‑4V, LLaVA, Gemini).
  • Empirical finding that visual intermediates do not boost reasoning accuracy; in many cases they even degrade performance due to error compounding.
  • Diagnostic analysis showing that UMMs can often generate correct final answers and plausible visuals, yet they fail to integrate the two—e.g., they cannot leverage ground‑truth visualizations to improve textual reasoning.
  • Open‑source release of the benchmark code and a set of diagnostic tools to help the community measure and close the gap between visual generation and visual reasoning.

Methodology

  1. Task Design – Each problem is a multi‑step logical puzzle (e.g., geometry, spatial planning, diagram‑based deduction) that admits a visual solution. Tasks are grouped into difficulty tiers and automatically generated to ensure diversity and scalability.
  2. Model Variants – The authors test three families:
    • Pure LLMs (text‑only).
    • Latent‑visual models that keep an internal visual token stream (no explicit image output).
    • Explicit‑visual models that generate an image at each reasoning step.
  3. Prompting Protocol – For each step, models receive a prompt describing the current sub‑goal and, when applicable, the previously generated visual (or a ground‑truth visual for ablation).
  4. Evaluation Metrics – Accuracy of the final answer, visual fidelity (when an image is produced), and a visual‑integration score that measures whether the visual actually influences the subsequent textual reasoning.
  5. Error Analysis – The authors trace failure modes: token drift in latent representations, image generation artifacts, and the inability of the language component to condition on visual inputs.

Results & Findings

Model familyFinal answer accuracy (no visual)Accuracy with latent visualAccuracy with explicit visual
LLM‑only68 %
Latent‑visual70 %62 % (drop)
Explicit‑visual71 %58 % (drop)
  • No performance gain from adding visual steps; in fact, both latent and explicit visual strategies suffer a 10–15 % drop in accuracy.
  • When fed ground‑truth visualizations, UMMs still fail to improve, indicating a disconnect between the visual encoder and the reasoning engine.
  • Visual outputs are often plausible (high image quality) but misaligned with the logical state needed for the next reasoning step, leading to compounding errors.
  • The benchmark’s stratification shows the gap widens on higher‑difficulty tiers, where multi‑step planning is more demanding.

Practical Implications

  • Tooling for developers – If you’re building AI assistants that rely on “thinking with pictures” (e.g., code‑to‑diagram generators, design assistants, robotics planners), this work warns that current UMMs won’t reliably use intermediate images to boost reasoning.
  • Model integration – Systems that chain a language model with a separate vision module (e.g., generate a diagram, then ask the LLM to interpret it) may need explicit hand‑off mechanisms rather than trusting the model to self‑coordinate.
  • Benchmark adoption – MentisOculi can serve as a regression test for any new multimodal architecture that claims visual reasoning capabilities, helping teams spot integration bugs early.
  • Product roadmaps – Companies aiming for truly multimodal agents should prioritize joint training or cross‑modal attention mechanisms that tightly bind visual and textual pathways, rather than treating visual generation as an afterthought.

Limitations & Future Work

  • The benchmark focuses on synthetic, geometry‑style problems; real‑world domains (e.g., medical imaging, CAD) may exhibit different dynamics.
  • Only a limited set of UMMs were evaluated; newer models released after the study could behave differently.
  • The analysis does not explore fine‑tuning strategies that explicitly teach models to condition on visual feedback—an avenue the authors suggest for future research.
  • Extending MentisOculi to include interactive visual reasoning (e.g., agents that can edit images) and to measure efficiency (compute cost of visual steps) are identified next steps.

Authors

  • Jana Zeller
  • Thaddäus Wiedemer
  • Fanfei Li
  • Thomas Klein
  • Prasanna Mayilvahanan
  • Matthias Bethge
  • Felix Wichmann
  • Ryan Cotterell
  • Wieland Brendel

Paper Information

  • arXiv ID: 2602.02465v1
  • Categories: cs.AI, cs.CV, cs.LG
  • Published: February 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »