[Paper] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Published: 22 hours ago (March 19, 2026 at 01:59 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2603.19235v1

Overview

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

Key Contributions

This paper presents research in the following areas:

cs.CV
cs.RO

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Xianjin Wu
Dingkang Liang
Tianrui Feng
Kui Xia
Yumeng Zhang
Xiaofan Li
Xiao Tan
Xiang Bai

Paper Information

arXiv ID: 2603.19235v1
Categories: cs.CV, cs.RO
Published: March 19, 2026
PDF: Download PDF

[Paper] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Matryoshka Gaussian Splatting

[Paper] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

[Paper] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

[Paper] Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer