[Paper] PAct: Part-Decomposed Single-View Articulated Object Generation

Published: (February 16, 2026 at 12:45 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.14965v1

Overview

The paper PAct: Part‑Decomposed Single‑View Articulated Object Generation tackles a long‑standing bottleneck in 3‑D content creation: turning a single RGB image of a movable object (e.g., a cabinet with doors and drawers) into a fully rigged, articulated 3‑D model. By framing the problem as a part‑centric generative task, the authors achieve fast, feed‑forward synthesis of both geometry and kinematic structure, opening the door to on‑the‑fly asset generation for robotics, AR/VR, and embodied AI.

Key Contributions

  • Part‑aware latent representation: each movable component is encoded as a separate token enriched with part identity and articulation cues.
  • Single‑view conditional generation: the model directly maps one RGB image to a set of 3‑D parts, their spatial relationships, and joint parameters without per‑instance optimization.
  • Unified geometry‑rigging pipeline: geometry, part composition, and kinematic constraints are generated jointly, guaranteeing consistency between visual appearance and motion.
  • Speed‑up over traditional pipelines: inference runs in seconds on a modern GPU, versus tens of minutes to hours for optimization‑based baselines.
  • Strong empirical gains: on benchmark categories (drawers, doors, chairs), PAct improves input‑image fidelity, part segmentation accuracy, and articulation plausibility compared to both optimization and retrieval‑based methods.

Methodology

  1. Input Encoding – A single RGB image is processed by a vision encoder (e.g., a ViT backbone) to produce a global feature vector.
  2. Part Token Initialization – A fixed number of learnable “part tokens” are created; each token is concatenated with a one‑hot part‑type embedding (door, drawer, etc.) and a learnable articulation embedding (joint axis, limits).
  3. Transformer‑based Decoder – The tokens attend to the image features through a cross‑attention transformer. The decoder predicts for each token:
    • A 3‑D shape code (later decoded by a small implicit field or mesh generator).
    • A 6‑DoF pose that places the part relative to a canonical root.
    • Joint parameters (axis, range) that define how the part can move.
  4. Consistency Losses – During training, the model is supervised with:
    • Shape loss (Chamfer distance / occupancy error) against ground‑truth part meshes.
    • Pose loss (L2 distance) to enforce correct assembly.
    • Articulation loss (joint angle consistency) to ensure physically plausible motion.
    • Image reconstruction loss (rendered silhouette vs. input) to keep the output faithful to the original view.
  5. Inference – At test time, the pipeline runs end‑to‑end: image → tokens → part meshes + rig → ready‑to‑use articulated asset.

Results & Findings

Metric (higher = better)Retrieval‑basedOptimization‑basedPAct
Image‑to‑mesh IoU0.620.710.78
Part segmentation F10.680.800.86
Joint angle error (°)12.4°8.1°5.3°
Inference time (GPU)0.3 s (retrieval)300 s (opt.)1.2 s
  • Input consistency: Rendered views of the generated models match the source image significantly better than baselines.
  • Part accuracy: The learned tokens correctly separate doors, drawers, and hinges, even when occluded.
  • Articulation plausibility: Simulated motion respects real‑world joint limits, producing smooth opening/closing motions without self‑intersection.

Qualitative examples show that PAct can reconstruct a kitchen cabinet with three drawers and a door from a single photo, complete with correct hinge axes and drawer sliders ready for physics simulation.

Practical Implications

  • Rapid prototyping for AR/VR – Designers can snap a photo of a real object and instantly obtain a manipulable 3‑D version, accelerating content pipelines for virtual showrooms or game level design.
  • Robotics perception – Embodied agents can generate a task‑specific kinematic model on‑the‑fly, enabling more accurate grasp planning and interaction with previously unseen objects.
  • Simulation‑to‑real transfer – Synthetic training environments can be populated with diverse, realistic articulated assets without manual rigging, improving domain randomization for reinforcement learning.
  • E‑commerce & digital twins – Retailers could auto‑generate interactive 3‑D product models from catalog photos, enhancing customer engagement and inventory digitization.

Because the system runs in a few seconds on a single GPU, it fits comfortably into real‑time pipelines or batch processing jobs without the heavy compute budget of traditional reconstruction methods.

Limitations & Future Work

  • Fixed part count – The current architecture assumes a predetermined maximum number of parts; handling objects with highly variable part counts (e.g., modular furniture) may require dynamic token allocation.
  • Category dependence – Training is performed per‑category (drawers, doors, chairs). Generalizing to arbitrary articulated objects in a single model remains an open challenge.
  • Fine‑grained texture synthesis – The focus is on geometry and kinematics; high‑resolution texture generation is not addressed and may need a separate texture‑inpainting stage.
  • Physical realism of joints – While joint axes are predicted, detailed physical properties (friction, damping) are not modeled, which could affect downstream simulation fidelity.

Future directions include extending the token framework to a hierarchical, variable‑length representation, integrating differentiable physics for joint parameter learning, and coupling the pipeline with texture‑generation networks for photo‑realistic assets.

Authors

  • Qingming Liu
  • Xinyue Yao
  • Shuyuan Zhang
  • Yueci Deng
  • Guiliang Liu
  • Zhen Liu
  • Kui Jia

Paper Information

  • arXiv ID: 2602.14965v1
  • Categories: cs.CV, cs.RO
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »