[Paper] SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Published: (November 26, 2025 at 10:04 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21471v1

Overview

The paper introduces SpatialBench, the first large‑scale benchmark that evaluates how well multimodal large language models (MLLMs) understand and reason about space. By breaking spatial cognition into a hierarchy of five levels—from raw perception to strategic planning—the authors expose where current models excel and where they still fall short, offering a roadmap for building truly spatially aware AI systems.

Key Contributions

  • Hierarchical spatial cognition framework: Defines five progressive levels (observation → grounding → symbolic reasoning → causal inference → planning) that capture the full spectrum of spatial intelligence.
  • SpatialBench benchmark: 15 carefully curated multimodal tasks (image‑text, video‑text, 3‑D scenes) aligned with the hierarchy, providing fine‑grained coverage of real‑world spatial scenarios.
  • Capability‑oriented metric: A unified scoring system that aggregates performance across heterogeneous tasks while preserving the hierarchical structure.
  • Comprehensive evaluation: Benchmarks dozens of state‑of‑the‑art MLLMs, revealing systematic strengths (perceptual grounding) and weaknesses (symbolic reasoning, planning).
  • Human vs. model analysis: Shows that humans perform selective, goal‑directed abstraction, whereas models tend to over‑focus on surface details, highlighting a gap in intentional spatial reasoning.

Methodology

  1. Taxonomy design – The authors consulted cognitive science literature and AI research to define five cognitive levels that reflect increasing abstraction and planning depth.
  2. Task construction – For each level, they created multiple tasks (e.g., “identify object locations,” “describe spatial relations,” “predict outcome of moving an object,” “plan a navigation route”). Data sources include existing vision‑language datasets, synthetic 3‑D environments, and custom video clips.
  3. Unified evaluation metric – Individual task scores are normalized and then weighted according to their cognitive level, yielding a single “spatial capability” score that respects the hierarchy.
  4. Model testing – Over 30 publicly available MLLMs (e.g., GPT‑4V, LLaVA, Gemini‑Pro Vision) are run on the benchmark using zero‑shot prompts; results are aggregated per level.
  5. Human baseline – A crowd‑sourced study collects human responses on a subset of tasks, enabling direct comparison with model behavior.

Results & Findings

  • Perceptual grounding (Level 1‑2): Most MLLMs achieve >80 % accuracy, indicating strong ability to locate and describe objects in images.
  • Symbolic reasoning (Level 3): Scores drop to ~45 %, showing difficulty in manipulating spatial symbols (e.g., “left of,” “inside”).
  • Causal inference (Level 4): Performance hovers around 30 %, reflecting limited understanding of how actions alter spatial configurations.
  • Planning (Level 5): The hardest tier, with best models scoring <20 %, meaning they cannot reliably generate multi‑step navigation or manipulation plans.
  • Human vs. model: Humans consistently ignore irrelevant visual clutter and focus on task‑relevant spatial cues, while models often “over‑attend” to details, leading to noisy or contradictory answers.

Practical Implications

  • Robotics & autonomous agents: SpatialBench highlights that current MLLMs are not ready for high‑level planning tasks like robot navigation or manipulation without additional reasoning modules.
  • AR/VR content creation: Developers can rely on MLLMs for quick object detection and description but should not expect them to generate coherent spatial narratives or layout suggestions.
  • Geospatial analytics: The benchmark can serve as a diagnostic tool to choose the right model for tasks such as satellite image annotation versus complex terrain reasoning.
  • Product roadmaps: Companies building multimodal assistants can use the hierarchical scores to prioritize research—e.g., adding symbolic reasoning layers or integrating external physics engines to boost causal inference.

Limitations & Future Work

  • Dataset bias: Many tasks rely on synthetic or curated scenes; real‑world clutter and lighting variations may affect generalization.
  • Prompt dependence: Zero‑shot performance can be highly sensitive to prompt phrasing; systematic prompt engineering was not explored.
  • Metric granularity: While the capability‑oriented metric aggregates scores, it may mask nuanced failure modes within a level.
  • Future directions: The authors suggest expanding SpatialBench to 3‑D video, incorporating interactive evaluation (e.g., embodied agents), and exploring hybrid architectures that combine LLM reasoning with dedicated spatial modules.

Authors

  • Peiran Xu
  • Sudong Wang
  • Yao Zhu
  • Jianing Li
  • Yunjian Zhang

Paper Information

  • arXiv ID: 2511.21471v1
  • Categories: cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »