[Paper] SpatialBench: Is Your Spatial Foundation Model an All-Round Player?
Source: arXiv - 2605.27367v1
Overview
Spatial foundation models—large neural networks that understand 3‑D scenes—have been making headlines for their impressive results on benchmark datasets. But can they truly handle the messiness of real‑world applications, from varying viewpoints to low‑resolution sensor data and tight hardware budgets? The paper SpatialBench: Is Your Spatial Foundation Model an All‑Round Player? introduces a massive, rigorously designed benchmark that puts these models through a full‑spectrum stress test.
Key Contributions
- SpatialBench benchmark: 19 publicly‑available datasets, 546 distinct scenes, spanning 5 spatial domains (e.g., indoor, outdoor, aerial, embodied, egocentric).
- Deterministic sampling pipeline: Guarantees reproducible results across runs and eliminates hidden randomness that can mask true model capabilities.
- Comprehensive evaluation matrix: 41 models covering 6 architectural paradigms (full‑context attention, bounded‑memory attention, voxel‑based, point‑cloud, hybrid, and transformer‑CNN hybrids) tested on 5 task suites (scene reconstruction, pose estimation, navigation, semantic mapping, and embodied interaction) under 4 input‑density regimes.
- DA‑Next‑5M dataset: A newly curated 5‑million‑frame collection targeting the “large‑scale data gap” identified in prior work.
- DA‑Next baseline: A strong, open‑source model trained on DA‑Next‑5M that sets a new performance reference for future spatial foundation research.
Methodology
- Deterministic data preparation – Instead of random frame selection, the authors pre‑compute a fixed set of camera poses and point‑cloud samplings for every scene. This removes stochastic variance and makes cross‑paper comparisons fair.
- Cross‑paradigm coverage – Models are grouped by how they handle spatial context (e.g., unlimited attention vs. sliding‑window memory). Each group is evaluated with the same inputs, so differences reflect architectural choices rather than data preprocessing.
- Multi‑density testing – Input point clouds are down‑sampled to 0.5 %, 1 %, 5 % and 10 % of the original density, mimicking low‑cost LiDAR or monocular depth sensors.
- Task suites – The benchmark runs each model through:
- Reconstruction (recovering full 3‑D geometry),
- Pose estimation (camera localization),
- Navigation (path planning in simulated agents),
- Semantic mapping (labeling scene parts), and
- Embodied interaction (real‑time decision making for VR/AR agents).
- Metrics – Accuracy (e.g., Chamfer distance, pose error), latency, memory footprint, and energy consumption are logged to capture both performance and deployment cost.
Results & Findings
- Full‑context attention wins accuracy – Models that can attend to the entire scene (e.g., vanilla Transformers) consistently achieve the lowest reconstruction and pose errors, but they hit memory limits on long sequences.
- Bounded‑memory tricks enable scalability – Sliding‑window or hierarchical attention schemes keep memory under control and allow processing of >10 k frames, albeit with a modest (~5‑10 %) accuracy drop.
- Domain alignment > dataset size – When a model is trained on data that closely matches the test domain (e.g., indoor‑centric training for indoor tasks), performance jumps dramatically, even if the training set is smaller than a generic large‑scale corpus.
- Data quality matters – Noisy depth maps or heavily compressed point clouds degrade performance far more than reducing the number of training samples.
- DA‑Next baseline sets a new bar – Trained on the 5 M‑frame DA‑Next dataset, DA‑Next outperforms prior state‑of‑the‑art models on 4 out of 5 task suites while staying within a 2 GB GPU memory budget.
Practical Implications
- Robotics & autonomous vehicles – Engineers can now benchmark their perception stacks against a deterministic, multi‑density suite, ensuring that a model’s claimed “real‑world readiness” holds up under low‑resolution LiDAR or edge‑device constraints.
- AR/VR developers – The embodied and egocentric tasks in SpatialBench directly reflect latency‑critical scenarios (hand tracking, indoor navigation). The findings suggest that lightweight bounded‑memory models are a pragmatic choice for on‑device inference.
- Cloud‑edge hybrid pipelines – The benchmark’s memory‑vs‑accuracy trade‑off curves help system architects decide when to offload full‑context attention to the cloud and keep a bounded‑memory fallback on the edge.
- Dataset curation – The strong impact of domain alignment encourages teams to invest in targeted data collection (e.g., warehouse‑specific scans) rather than indiscriminately scaling generic datasets.
- Open‑source baseline – DA‑Next is released under a permissive license, giving developers a ready‑to‑fine‑tune starting point for custom spatial applications.
Limitations & Future Work
- Hardware diversity – Evaluations were performed on a limited set of GPUs/TPUs; performance on ultra‑low‑power ASICs or mobile NPUs remains untested.
- Static scenes only – While the benchmark includes varied viewpoints, dynamic objects (e.g., moving people) are not extensively covered, leaving a gap for real‑time interaction scenarios.
- Benchmark expansion – The authors note plans to add more outdoor and aerial datasets, as well as to incorporate multimodal signals (audio, tactile) that are increasingly relevant for embodied AI.
- Model interpretability – Understanding why full‑context attention excels (e.g., specific attention patterns) is left for future analysis, which could inspire more efficient hybrid architectures.
SpatialBench offers the first truly holistic yardstick for spatial foundation models, giving developers concrete data to choose, tune, and deploy the right model for their 3‑D AI challenges.
Authors
- Haosong Peng
- Hao Li
- Jiaqi Chen
- Yuhao Pan
- Runmao Yao
- Yalun Dai
- Fushuo Huo
- Fangzhou Hong
- Zhaoxi Chen
- Haozhao Wang
- Dingwen Zhang
- Ziwei Liu
- Wenchao Xu
Paper Information
- arXiv ID: 2605.27367v1
- Categories: cs.CV
- Published: May 26, 2026
- PDF: Download PDF