[Paper] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Published: 22 hours ago (March 19, 2026 at 01:59 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2603.19231v1

Overview

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

Key Contributions

This paper presents research in the following areas:

cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Haitian Li
Haozhe Xie
Junxiang Xu
Beichen Wen
Fangzhou Hong
Ziwei Liu

Paper Information

arXiv ID: 2603.19231v1
Categories: cs.CV
Published: March 19, 2026
PDF: Download PDF

[Paper] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

[Paper] Matryoshka Gaussian Splatting

[Paper] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

[Paper] Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer