[Paper] SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Source: arXiv - 2511.21471v1
Overview
The paper introduces SpatialBench, the first large‑scale benchmark that evaluates how well multimodal large language models (MLLMs) understand and reason about space. By breaking spatial cognition into a hierarchy of five levels—from raw perception to strategic planning—the authors expose where current models excel and where they still fall short, offering a roadmap for building truly spatially aware AI systems.
Key Contributions
- Hierarchical spatial cognition framework: Defines five progressive levels (observation → grounding → symbolic reasoning → causal inference → planning) that capture the full spectrum of spatial intelligence.
- SpatialBench benchmark: 15 carefully curated multimodal tasks (image‑text, video‑text, 3‑D scenes) aligned with the hierarchy, providing fine‑grained coverage of real‑world spatial scenarios.
- Capability‑oriented metric: A unified scoring system that aggregates performance across heterogeneous tasks while preserving the hierarchical structure.
- Comprehensive evaluation: Benchmarks dozens of state‑of‑the‑art MLLMs, revealing systematic strengths (perceptual grounding) and weaknesses (symbolic reasoning, planning).
- Human vs. model analysis: Shows that humans perform selective, goal‑directed abstraction, whereas models tend to over‑focus on surface details, highlighting a gap in intentional spatial reasoning.
Methodology
- Taxonomy design – The authors consulted cognitive science literature and AI research to define five cognitive levels that reflect increasing abstraction and planning depth.
- Task construction – For each level, they created multiple tasks (e.g., “identify object locations,” “describe spatial relations,” “predict outcome of moving an object,” “plan a navigation route”). Data sources include existing vision‑language datasets, synthetic 3‑D environments, and custom video clips.
- Unified evaluation metric – Individual task scores are normalized and then weighted according to their cognitive level, yielding a single “spatial capability” score that respects the hierarchy.
- Model testing – Over 30 publicly available MLLMs (e.g., GPT‑4V, LLaVA, Gemini‑Pro Vision) are run on the benchmark using zero‑shot prompts; results are aggregated per level.
- Human baseline – A crowd‑sourced study collects human responses on a subset of tasks, enabling direct comparison with model behavior.
Results & Findings
- Perceptual grounding (Level 1‑2): Most MLLMs achieve >80 % accuracy, indicating strong ability to locate and describe objects in images.
- Symbolic reasoning (Level 3): Scores drop to ~45 %, showing difficulty in manipulating spatial symbols (e.g., “left of,” “inside”).
- Causal inference (Level 4): Performance hovers around 30 %, reflecting limited understanding of how actions alter spatial configurations.
- Planning (Level 5): The hardest tier, with best models scoring <20 %, meaning they cannot reliably generate multi‑step navigation or manipulation plans.
- Human vs. model: Humans consistently ignore irrelevant visual clutter and focus on task‑relevant spatial cues, while models often “over‑attend” to details, leading to noisy or contradictory answers.
Practical Implications
- Robotics & autonomous agents: SpatialBench highlights that current MLLMs are not ready for high‑level planning tasks like robot navigation or manipulation without additional reasoning modules.
- AR/VR content creation: Developers can rely on MLLMs for quick object detection and description but should not expect them to generate coherent spatial narratives or layout suggestions.
- Geospatial analytics: The benchmark can serve as a diagnostic tool to choose the right model for tasks such as satellite image annotation versus complex terrain reasoning.
- Product roadmaps: Companies building multimodal assistants can use the hierarchical scores to prioritize research—e.g., adding symbolic reasoning layers or integrating external physics engines to boost causal inference.
Limitations & Future Work
- Dataset bias: Many tasks rely on synthetic or curated scenes; real‑world clutter and lighting variations may affect generalization.
- Prompt dependence: Zero‑shot performance can be highly sensitive to prompt phrasing; systematic prompt engineering was not explored.
- Metric granularity: While the capability‑oriented metric aggregates scores, it may mask nuanced failure modes within a level.
- Future directions: The authors suggest expanding SpatialBench to 3‑D video, incorporating interactive evaluation (e.g., embodied agents), and exploring hybrid architectures that combine LLM reasoning with dedicated spatial modules.
Authors
- Peiran Xu
- Sudong Wang
- Yao Zhu
- Jianing Li
- Yunjian Zhang
Paper Information
- arXiv ID: 2511.21471v1
- Categories: cs.AI
- Published: November 26, 2025
- PDF: Download PDF