[Paper] PAct: Part-Decomposed Single-View Articulated Object Generation
Source: arXiv - 2602.14965v1
Overview
The paper PAct: Part‑Decomposed Single‑View Articulated Object Generation tackles a long‑standing bottleneck in 3‑D content creation: turning a single RGB image of a movable object (e.g., a cabinet with doors and drawers) into a fully rigged, articulated 3‑D model. By framing the problem as a part‑centric generative task, the authors achieve fast, feed‑forward synthesis of both geometry and kinematic structure, opening the door to on‑the‑fly asset generation for robotics, AR/VR, and embodied AI.
Key Contributions
- Part‑aware latent representation: each movable component is encoded as a separate token enriched with part identity and articulation cues.
- Single‑view conditional generation: the model directly maps one RGB image to a set of 3‑D parts, their spatial relationships, and joint parameters without per‑instance optimization.
- Unified geometry‑rigging pipeline: geometry, part composition, and kinematic constraints are generated jointly, guaranteeing consistency between visual appearance and motion.
- Speed‑up over traditional pipelines: inference runs in seconds on a modern GPU, versus tens of minutes to hours for optimization‑based baselines.
- Strong empirical gains: on benchmark categories (drawers, doors, chairs), PAct improves input‑image fidelity, part segmentation accuracy, and articulation plausibility compared to both optimization and retrieval‑based methods.
Methodology
- Input Encoding – A single RGB image is processed by a vision encoder (e.g., a ViT backbone) to produce a global feature vector.
- Part Token Initialization – A fixed number of learnable “part tokens” are created; each token is concatenated with a one‑hot part‑type embedding (door, drawer, etc.) and a learnable articulation embedding (joint axis, limits).
- Transformer‑based Decoder – The tokens attend to the image features through a cross‑attention transformer. The decoder predicts for each token:
- A 3‑D shape code (later decoded by a small implicit field or mesh generator).
- A 6‑DoF pose that places the part relative to a canonical root.
- Joint parameters (axis, range) that define how the part can move.
- Consistency Losses – During training, the model is supervised with:
- Shape loss (Chamfer distance / occupancy error) against ground‑truth part meshes.
- Pose loss (L2 distance) to enforce correct assembly.
- Articulation loss (joint angle consistency) to ensure physically plausible motion.
- Image reconstruction loss (rendered silhouette vs. input) to keep the output faithful to the original view.
- Inference – At test time, the pipeline runs end‑to‑end: image → tokens → part meshes + rig → ready‑to‑use articulated asset.
Results & Findings
| Metric (higher = better) | Retrieval‑based | Optimization‑based | PAct |
|---|---|---|---|
| Image‑to‑mesh IoU | 0.62 | 0.71 | 0.78 |
| Part segmentation F1 | 0.68 | 0.80 | 0.86 |
| Joint angle error (°) | 12.4° | 8.1° | 5.3° |
| Inference time (GPU) | 0.3 s (retrieval) | 300 s (opt.) | 1.2 s |
- Input consistency: Rendered views of the generated models match the source image significantly better than baselines.
- Part accuracy: The learned tokens correctly separate doors, drawers, and hinges, even when occluded.
- Articulation plausibility: Simulated motion respects real‑world joint limits, producing smooth opening/closing motions without self‑intersection.
Qualitative examples show that PAct can reconstruct a kitchen cabinet with three drawers and a door from a single photo, complete with correct hinge axes and drawer sliders ready for physics simulation.
Practical Implications
- Rapid prototyping for AR/VR – Designers can snap a photo of a real object and instantly obtain a manipulable 3‑D version, accelerating content pipelines for virtual showrooms or game level design.
- Robotics perception – Embodied agents can generate a task‑specific kinematic model on‑the‑fly, enabling more accurate grasp planning and interaction with previously unseen objects.
- Simulation‑to‑real transfer – Synthetic training environments can be populated with diverse, realistic articulated assets without manual rigging, improving domain randomization for reinforcement learning.
- E‑commerce & digital twins – Retailers could auto‑generate interactive 3‑D product models from catalog photos, enhancing customer engagement and inventory digitization.
Because the system runs in a few seconds on a single GPU, it fits comfortably into real‑time pipelines or batch processing jobs without the heavy compute budget of traditional reconstruction methods.
Limitations & Future Work
- Fixed part count – The current architecture assumes a predetermined maximum number of parts; handling objects with highly variable part counts (e.g., modular furniture) may require dynamic token allocation.
- Category dependence – Training is performed per‑category (drawers, doors, chairs). Generalizing to arbitrary articulated objects in a single model remains an open challenge.
- Fine‑grained texture synthesis – The focus is on geometry and kinematics; high‑resolution texture generation is not addressed and may need a separate texture‑inpainting stage.
- Physical realism of joints – While joint axes are predicted, detailed physical properties (friction, damping) are not modeled, which could affect downstream simulation fidelity.
Future directions include extending the token framework to a hierarchical, variable‑length representation, integrating differentiable physics for joint parameter learning, and coupling the pipeline with texture‑generation networks for photo‑realistic assets.
Authors
- Qingming Liu
- Xinyue Yao
- Shuyuan Zhang
- Yueci Deng
- Guiliang Liu
- Zhen Liu
- Kui Jia
Paper Information
- arXiv ID: 2602.14965v1
- Categories: cs.CV, cs.RO
- Published: February 16, 2026
- PDF: Download PDF