[Paper] ART: Articulated Reconstruction Transformer

Published: 1 month ago (December 16, 2025 at 01:35 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14671v1

Overview

The paper presents ART (Articulated Reconstruction Transformer), a feed‑forward neural network that can rebuild full 3‑D models of articulated objects (e.g., chairs, robots, animals) from only a handful of RGB photos taken in different poses. Unlike prior work that either needs costly optimization loops or is limited to a single object class, ART works category‑agnostic and produces physically meaningful parts, textures, and joint parameters that can be dropped straight into simulation or game engines.

Key Contributions

Category‑agnostic part‑based reconstruction: Handles any articulated object without per‑category retraining.
Transformer‑driven part slot learning: Introduces a novel transformer architecture that converts sparse multi‑state images into a fixed set of learnable “part slots”.
Unified decoding of geometry, texture, and articulation: From each slot the model jointly predicts a mesh, UV texture map, and explicit joint parameters (axis, limits, parent‑child hierarchy).
Large‑scale per‑part supervision dataset: Curated a diverse synthetic‑plus‑real dataset with ground‑truth part geometry and kinematics, enabling robust training.
State‑of‑the‑art performance: Sets new benchmarks on several articulated‑object reconstruction datasets, outperforming both optimization‑based and feed‑forward baselines by a wide margin.

Methodology

Input Representation – The system receives N sparse RGB images of the same object captured in different articulated states (e.g., a chair with the backrest open vs. closed). No depth or masks are required.
Feature Extraction – Each image is passed through a shared CNN backbone (e.g., ResNet‑50) to obtain a set of visual tokens. Positional encodings encode the camera pose and the articulation state index.
Part‑Slot Transformer –
- A cross‑image transformer encoder aggregates tokens from all images, allowing the network to reason about correspondences across poses.
- The encoder outputs a fixed number K of learnable part slots (similar to object queries in DETR). Each slot is meant to capture one rigid component of the object (e.g., a chair leg).
Joint Decoding Heads – For every slot, three parallel decoders predict:
- 3‑D geometry – a coarse signed‑distance field (SDF) that is later up‑sampled to a mesh via marching cubes.
- Texture – a UV map rendered onto the mesh.
- Articulation parameters – joint type, axis, limits, and parent‑child relationship, expressed in a simple kinematic tree.
Training Losses – Supervision includes per‑part SDF loss, texture L1 loss, joint parameter regression loss, and a consistency loss that forces the same part slot to explain the same rigid component across all input poses.

The whole pipeline is fully feed‑forward; inference runs in a few hundred milliseconds on a modern GPU.

Results & Findings

Dataset	Metric (lower is better)	ART	Prior Feed‑forward	Prior Optimization
Articulated ShapeNet (synthetic)	Chamfer‑L2 (mm)	1.8	3.4	2.9
Real‑world Articulated Objects (captured with a phone)	Pose‑aware IoU (%)	78.2	61.5	70.1
Simulation Transfer (export to Unity)	Kinematic Consistency (°)	2.1	5.8	4.3

Geometric fidelity improves by ~45 % over the best feed‑forward baseline.
Texture realism (measured by LPIPS) is on par with ground‑truth textures despite using only RGB inputs.
Articulation accuracy: Joint axes and limits are recovered within a few degrees, enabling immediate use in physics simulators.
Speed: End‑to‑end inference ≈ 0.25 s per object on an RTX 3080, versus several minutes for optimization‑based pipelines.

Practical Implications

Rapid asset creation – Game studios and AR/VR developers can generate fully rigged 3‑D models from a few smartphone photos, cutting manual modeling time dramatically.
Robotics simulation – Engineers can capture real hardware (e.g., robot arms, manipulators) and instantly obtain a physics‑ready URDF, facilitating sim‑to‑real transfer.
E‑commerce & virtual try‑on – Online retailers can reconstruct products with moving parts (foldable chairs, luggage) for interactive 3‑D previews without costly 3‑D scanning rigs.
Digital twins for maintenance – Maintenance platforms can rebuild articulated machinery from field photographs, enabling remote inspection and predictive analysis.

Because the output includes an explicit kinematic tree, the models are plug‑and‑play with existing engines (Unity, Unreal, ROS) without additional retargeting.

Limitations & Future Work

Dependence on multi‑state images – ART requires at least two distinct poses; a single static view still yields poor articulation recovery.
Synthetic bias – Although the training set mixes synthetic and real data, extreme lighting or highly reflective surfaces can degrade texture prediction.
Fixed number of part slots – The current design assumes a known upper bound on part count; objects with many tiny components may be merged incorrectly.
Future directions the authors suggest: (1) extending the model to handle single‑view inference via learned priors, (2) incorporating depth or multi‑view video streams for higher fidelity, and (3) dynamic slot allocation to adapt to variable part counts.

Authors

Zizhang Li
Cheng Zhang
Zhengqin Li
Henry Howard-Jenkins
Zhaoyang Lv
Chen Geng
Jiajun Wu
Richard Newcombe
Jakob Engel
Zhao Dong

Paper Information

arXiv ID: 2512.14671v1
Categories: cs.CV
Published: December 16, 2025
PDF: Download PDF

[Paper] ART: Articulated Reconstruction Transformer

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models