[Paper] SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Published: 3 days ago (June 8, 2026 at 11:51 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09669v1

Overview

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

Key Contributions

This paper presents research in the following areas:

cs.AI
cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

Hongcheng Gao
Hailong Qu
Jingyi Tang
Jiahao Wang
Zihao Huang
Hengkang Qiao
Shihong Huang
Junming Yang
Yi Li
Hongyixuan Yuan
Wenjie Li
Bohan Zeng
Wenbo Li
Bo Wang
Jianhui Liu
Olive Huang
Haoyang Huang
Wentao Zhang
Guoqing Huang
Nan Duan
Yinpeng Dong

Paper Information

arXiv ID: 2606.09669v1
Categories: cs.AI, cs.CL
Published: June 8, 2026
PDF: Download PDF

[Paper] SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

[Paper] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

[Paper] Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling