[Paper] SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Published: 2 months ago (November 26, 2025 at 10:04 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21471v1

Overview

The paper introduces SpatialBench, the first large‑scale benchmark that evaluates how well multimodal large language models (MLLMs) understand and reason about space. By breaking spatial cognition into a hierarchy of five levels—from raw perception to strategic planning—the authors expose where current models excel and where they still fall short, offering a roadmap for building truly spatially aware AI systems.

Key Contributions

Hierarchical spatial cognition framework: Defines five progressive levels (observation → grounding → symbolic reasoning → causal inference → planning) that capture the full spectrum of spatial intelligence.
SpatialBench benchmark: 15 carefully curated multimodal tasks (image‑text, video‑text, 3‑D scenes) aligned with the hierarchy, providing fine‑grained coverage of real‑world spatial scenarios.
Capability‑oriented metric: A unified scoring system that aggregates performance across heterogeneous tasks while preserving the hierarchical structure.
Comprehensive evaluation: Benchmarks dozens of state‑of‑the‑art MLLMs, revealing systematic strengths (perceptual grounding) and weaknesses (symbolic reasoning, planning).
Human vs. model analysis: Shows that humans perform selective, goal‑directed abstraction, whereas models tend to over‑focus on surface details, highlighting a gap in intentional spatial reasoning.

Methodology

Taxonomy design – The authors consulted cognitive science literature and AI research to define five cognitive levels that reflect increasing abstraction and planning depth.
Task construction – For each level, they created multiple tasks (e.g., “identify object locations,” “describe spatial relations,” “predict outcome of moving an object,” “plan a navigation route”). Data sources include existing vision‑language datasets, synthetic 3‑D environments, and custom video clips.
Unified evaluation metric – Individual task scores are normalized and then weighted according to their cognitive level, yielding a single “spatial capability” score that respects the hierarchy.
Model testing – Over 30 publicly available MLLMs (e.g., GPT‑4V, LLaVA, Gemini‑Pro Vision) are run on the benchmark using zero‑shot prompts; results are aggregated per level.
Human baseline – A crowd‑sourced study collects human responses on a subset of tasks, enabling direct comparison with model behavior.

Results & Findings

Perceptual grounding (Level 1‑2): Most MLLMs achieve >80 % accuracy, indicating strong ability to locate and describe objects in images.
Symbolic reasoning (Level 3): Scores drop to ~45 %, showing difficulty in manipulating spatial symbols (e.g., “left of,” “inside”).
Causal inference (Level 4): Performance hovers around 30 %, reflecting limited understanding of how actions alter spatial configurations.
Planning (Level 5): The hardest tier, with best models scoring <20 %, meaning they cannot reliably generate multi‑step navigation or manipulation plans.
Human vs. model: Humans consistently ignore irrelevant visual clutter and focus on task‑relevant spatial cues, while models often “over‑attend” to details, leading to noisy or contradictory answers.

Practical Implications

Robotics & autonomous agents: SpatialBench highlights that current MLLMs are not ready for high‑level planning tasks like robot navigation or manipulation without additional reasoning modules.
AR/VR content creation: Developers can rely on MLLMs for quick object detection and description but should not expect them to generate coherent spatial narratives or layout suggestions.
Geospatial analytics: The benchmark can serve as a diagnostic tool to choose the right model for tasks such as satellite image annotation versus complex terrain reasoning.
Product roadmaps: Companies building multimodal assistants can use the hierarchical scores to prioritize research—e.g., adding symbolic reasoning layers or integrating external physics engines to boost causal inference.

Limitations & Future Work

Dataset bias: Many tasks rely on synthetic or curated scenes; real‑world clutter and lighting variations may affect generalization.
Prompt dependence: Zero‑shot performance can be highly sensitive to prompt phrasing; systematic prompt engineering was not explored.
Metric granularity: While the capability‑oriented metric aggregates scores, it may mask nuanced failure modes within a level.
Future directions: The authors suggest expanding SpatialBench to 3‑D video, incorporating interactive evaluation (e.g., embodied agents), and exploring hybrid architectures that combine LLM reasoning with dedicated spatial modules.

Authors

Peiran Xu
Sudong Wang
Yao Zhu
Jianing Li
Yunjian Zhang

Paper Information

arXiv ID: 2511.21471v1
Categories: cs.AI
Published: November 26, 2025
PDF: Download PDF

[Paper] SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval