[Paper] ART: Articulated Reconstruction Transformer

Published: (December 16, 2025 at 01:35 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.14671v1

Overview

The paper presents ART (Articulated Reconstruction Transformer), a feed‑forward neural network that can rebuild full 3‑D models of articulated objects (e.g., chairs, robots, animals) from only a handful of RGB photos taken in different poses. Unlike prior work that either needs costly optimization loops or is limited to a single object class, ART works category‑agnostic and produces physically meaningful parts, textures, and joint parameters that can be dropped straight into simulation or game engines.

Key Contributions

  • Category‑agnostic part‑based reconstruction: Handles any articulated object without per‑category retraining.
  • Transformer‑driven part slot learning: Introduces a novel transformer architecture that converts sparse multi‑state images into a fixed set of learnable “part slots”.
  • Unified decoding of geometry, texture, and articulation: From each slot the model jointly predicts a mesh, UV texture map, and explicit joint parameters (axis, limits, parent‑child hierarchy).
  • Large‑scale per‑part supervision dataset: Curated a diverse synthetic‑plus‑real dataset with ground‑truth part geometry and kinematics, enabling robust training.
  • State‑of‑the‑art performance: Sets new benchmarks on several articulated‑object reconstruction datasets, outperforming both optimization‑based and feed‑forward baselines by a wide margin.

Methodology

  1. Input Representation – The system receives N sparse RGB images of the same object captured in different articulated states (e.g., a chair with the backrest open vs. closed). No depth or masks are required.
  2. Feature Extraction – Each image is passed through a shared CNN backbone (e.g., ResNet‑50) to obtain a set of visual tokens. Positional encodings encode the camera pose and the articulation state index.
  3. Part‑Slot Transformer
    • A cross‑image transformer encoder aggregates tokens from all images, allowing the network to reason about correspondences across poses.
    • The encoder outputs a fixed number K of learnable part slots (similar to object queries in DETR). Each slot is meant to capture one rigid component of the object (e.g., a chair leg).
  4. Joint Decoding Heads – For every slot, three parallel decoders predict:
    • 3‑D geometry – a coarse signed‑distance field (SDF) that is later up‑sampled to a mesh via marching cubes.
    • Texture – a UV map rendered onto the mesh.
    • Articulation parameters – joint type, axis, limits, and parent‑child relationship, expressed in a simple kinematic tree.
  5. Training Losses – Supervision includes per‑part SDF loss, texture L1 loss, joint parameter regression loss, and a consistency loss that forces the same part slot to explain the same rigid component across all input poses.

The whole pipeline is fully feed‑forward; inference runs in a few hundred milliseconds on a modern GPU.

Results & Findings

DatasetMetric (lower is better)ARTPrior Feed‑forwardPrior Optimization
Articulated ShapeNet (synthetic)Chamfer‑L2 (mm)1.83.42.9
Real‑world Articulated Objects (captured with a phone)Pose‑aware IoU (%)78.261.570.1
Simulation Transfer (export to Unity)Kinematic Consistency (°)2.15.84.3
  • Geometric fidelity improves by ~45 % over the best feed‑forward baseline.
  • Texture realism (measured by LPIPS) is on par with ground‑truth textures despite using only RGB inputs.
  • Articulation accuracy: Joint axes and limits are recovered within a few degrees, enabling immediate use in physics simulators.
  • Speed: End‑to‑end inference ≈ 0.25 s per object on an RTX 3080, versus several minutes for optimization‑based pipelines.

Practical Implications

  • Rapid asset creation – Game studios and AR/VR developers can generate fully rigged 3‑D models from a few smartphone photos, cutting manual modeling time dramatically.
  • Robotics simulation – Engineers can capture real hardware (e.g., robot arms, manipulators) and instantly obtain a physics‑ready URDF, facilitating sim‑to‑real transfer.
  • E‑commerce & virtual try‑on – Online retailers can reconstruct products with moving parts (foldable chairs, luggage) for interactive 3‑D previews without costly 3‑D scanning rigs.
  • Digital twins for maintenance – Maintenance platforms can rebuild articulated machinery from field photographs, enabling remote inspection and predictive analysis.

Because the output includes an explicit kinematic tree, the models are plug‑and‑play with existing engines (Unity, Unreal, ROS) without additional retargeting.

Limitations & Future Work

  • Dependence on multi‑state images – ART requires at least two distinct poses; a single static view still yields poor articulation recovery.
  • Synthetic bias – Although the training set mixes synthetic and real data, extreme lighting or highly reflective surfaces can degrade texture prediction.
  • Fixed number of part slots – The current design assumes a known upper bound on part count; objects with many tiny components may be merged incorrectly.
  • Future directions the authors suggest: (1) extending the model to handle single‑view inference via learned priors, (2) incorporating depth or multi‑view video streams for higher fidelity, and (3) dynamic slot allocation to adapt to variable part counts.

Authors

  • Zizhang Li
  • Cheng Zhang
  • Zhengqin Li
  • Henry Howard-Jenkins
  • Zhaoyang Lv
  • Chen Geng
  • Jiajun Wu
  • Richard Newcombe
  • Jakob Engel
  • Zhao Dong

Paper Information

  • arXiv ID: 2512.14671v1
  • Categories: cs.CV
  • Published: December 16, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »