[Paper] PAct: Part-Decomposed Single-View Articulated Object Generation

Published: 3 days ago (February 16, 2026 at 12:45 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.14965v1

Overview

The paper PAct: Part‑Decomposed Single‑View Articulated Object Generation tackles a long‑standing bottleneck in 3‑D content creation: turning a single RGB image of a movable object (e.g., a cabinet with doors and drawers) into a fully rigged, articulated 3‑D model. By framing the problem as a part‑centric generative task, the authors achieve fast, feed‑forward synthesis of both geometry and kinematic structure, opening the door to on‑the‑fly asset generation for robotics, AR/VR, and embodied AI.

Key Contributions

Part‑aware latent representation: each movable component is encoded as a separate token enriched with part identity and articulation cues.
Single‑view conditional generation: the model directly maps one RGB image to a set of 3‑D parts, their spatial relationships, and joint parameters without per‑instance optimization.
Unified geometry‑rigging pipeline: geometry, part composition, and kinematic constraints are generated jointly, guaranteeing consistency between visual appearance and motion.
Speed‑up over traditional pipelines: inference runs in seconds on a modern GPU, versus tens of minutes to hours for optimization‑based baselines.
Strong empirical gains: on benchmark categories (drawers, doors, chairs), PAct improves input‑image fidelity, part segmentation accuracy, and articulation plausibility compared to both optimization and retrieval‑based methods.

Methodology

Input Encoding – A single RGB image is processed by a vision encoder (e.g., a ViT backbone) to produce a global feature vector.
Part Token Initialization – A fixed number of learnable “part tokens” are created; each token is concatenated with a one‑hot part‑type embedding (door, drawer, etc.) and a learnable articulation embedding (joint axis, limits).
Transformer‑based Decoder – The tokens attend to the image features through a cross‑attention transformer. The decoder predicts for each token:
- A 3‑D shape code (later decoded by a small implicit field or mesh generator).
- A 6‑DoF pose that places the part relative to a canonical root.
- Joint parameters (axis, range) that define how the part can move.
Consistency Losses – During training, the model is supervised with:
- Shape loss (Chamfer distance / occupancy error) against ground‑truth part meshes.
- Pose loss (L2 distance) to enforce correct assembly.
- Articulation loss (joint angle consistency) to ensure physically plausible motion.
- Image reconstruction loss (rendered silhouette vs. input) to keep the output faithful to the original view.
Inference – At test time, the pipeline runs end‑to‑end: image → tokens → part meshes + rig → ready‑to‑use articulated asset.

Results & Findings

Metric (higher = better)	Retrieval‑based	Optimization‑based	PAct
Image‑to‑mesh IoU	0.62	0.71	0.78
Part segmentation F1	0.68	0.80	0.86
Joint angle error (°)	12.4°	8.1°	5.3°
Inference time (GPU)	0.3 s (retrieval)	300 s (opt.)	1.2 s

Input consistency: Rendered views of the generated models match the source image significantly better than baselines.
Part accuracy: The learned tokens correctly separate doors, drawers, and hinges, even when occluded.
Articulation plausibility: Simulated motion respects real‑world joint limits, producing smooth opening/closing motions without self‑intersection.

Qualitative examples show that PAct can reconstruct a kitchen cabinet with three drawers and a door from a single photo, complete with correct hinge axes and drawer sliders ready for physics simulation.

Practical Implications

Rapid prototyping for AR/VR – Designers can snap a photo of a real object and instantly obtain a manipulable 3‑D version, accelerating content pipelines for virtual showrooms or game level design.
Robotics perception – Embodied agents can generate a task‑specific kinematic model on‑the‑fly, enabling more accurate grasp planning and interaction with previously unseen objects.
Simulation‑to‑real transfer – Synthetic training environments can be populated with diverse, realistic articulated assets without manual rigging, improving domain randomization for reinforcement learning.
E‑commerce & digital twins – Retailers could auto‑generate interactive 3‑D product models from catalog photos, enhancing customer engagement and inventory digitization.

Because the system runs in a few seconds on a single GPU, it fits comfortably into real‑time pipelines or batch processing jobs without the heavy compute budget of traditional reconstruction methods.

Limitations & Future Work

Fixed part count – The current architecture assumes a predetermined maximum number of parts; handling objects with highly variable part counts (e.g., modular furniture) may require dynamic token allocation.
Category dependence – Training is performed per‑category (drawers, doors, chairs). Generalizing to arbitrary articulated objects in a single model remains an open challenge.
Fine‑grained texture synthesis – The focus is on geometry and kinematics; high‑resolution texture generation is not addressed and may need a separate texture‑inpainting stage.
Physical realism of joints – While joint axes are predicted, detailed physical properties (friction, damping) are not modeled, which could affect downstream simulation fidelity.

Future directions include extending the token framework to a hierarchical, variable‑length representation, integrating differentiable physics for joint parameter learning, and coupling the pipeline with texture‑generation networks for photo‑realistic assets.

Authors

Qingming Liu
Xinyue Yao
Shuyuan Zhang
Yueci Deng
Guiliang Liu
Zhen Liu
Kui Jia

Paper Information

arXiv ID: 2602.14965v1
Categories: cs.CV, cs.RO
Published: February 16, 2026
PDF: Download PDF

[Paper] PAct: Part-Decomposed Single-View Articulated Object Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

[Paper] Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

[Paper] Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

[Paper] Are Object-Centric Representations Better At Compositional Generalization?