[Paper] Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization
Source: arXiv - 2602.20150v1
Overview
The paper tackles a core bottleneck for robotics and simulation pipelines: turning raw sensor data of a messy, real‑world tabletop into a simulation‑ready scene—complete with accurate 3‑D shapes, poses, and physically plausible contacts. By marrying a differentiable contact model with a clever sparse‑matrix solver, the authors deliver a system that can jointly refine the geometry and placement of several interacting objects, even in heavily cluttered setups.
Key Contributions
- Physics‑aware joint optimization of object shape and pose, rather than treating them as separate steps.
- Introduction of a shape‑differentiable contact model that remains globally differentiable, enabling gradient‑based updates through contact constraints.
- Exploitation of the structured sparsity of the augmented‑Lagrangian Hessian to build a scalable linear solver whose runtime grows modestly with the number of objects.
- An end‑to‑end pipeline that combines:
- Learning‑based object detection & coarse initialization,
- Physics‑constrained joint shape‑pose refinement, and
- Differentiable texture refinement for visual realism.
- Empirical validation on scenes with up to 5 objects (22 convex hull components) showing robust recovery of physically valid and simulation‑ready models.
Methodology
- Initial Guess – A pretrained object detector provides rough bounding boxes and class‑level shape priors (e.g., convex hull templates).
- Differentiable Contact Model – Each object is represented by a set of convex hulls. The contact model computes inter‑object penetration depth and normal forces analytically, and crucially, its gradients are defined everywhere (no “dead zones” at contact).
- Joint Optimization Objective
- Data term: Align rendered depth/segmentation with the observed sensor data.
- Shape regularizer: Keep the refined hulls close to the learned priors (prevent degenerate geometry).
- Physics term: Enforce non‑penetration and static equilibrium using the differentiable contacts.
- Augmented Lagrangian Solver – The objective is solved with an augmented‑Lagrangian method. Because each contact only couples a small subset of hulls, the Hessian matrix is block‑sparse. The authors derive a custom linear‑system solver that leverages this sparsity, yielding near‑linear scaling with the number of objects.
- Texture Refinement – After geometry converges, a differentiable rendering pass updates per‑object texture maps to better match the RGB observation, completing the simulation‑ready asset.
Results & Findings
| Metric | Baseline (separate shape & pose) | Proposed Method |
|---|---|---|
| Pose RMSE (cm) | 2.8 | 1.4 |
| Shape IoU (convex hull) | 0.71 | 0.86 |
| Contact violation (mm) | 3.2 | 0.4 |
| Runtime (per scene) | 45 s | 9 s (5‑object case) |
- The system consistently produces physically stable configurations (no inter‑penetrations) even when the initial guess is heavily perturbed.
- Visual inspection shows that the refined textures blend seamlessly with the background, making the output ready for photorealistic simulators (e.g., Isaac Gym, MuJoCo).
- Scaling experiments reveal that adding more objects increases runtime sub‑linearly, confirming the benefit of the sparse Hessian solver.
Practical Implications
- Robotics simulation pipelines can ingest raw RGB‑D streams from a lab bench and instantly generate accurate, physics‑compliant models for downstream planning, reinforcement learning, or digital twins.
- Game and AR/VR developers gain a tool to auto‑populate scenes with realistic object meshes and collision shapes directly from scanned environments, cutting manual asset creation time.
- Manufacturing inspection systems can automatically reconstruct part geometries and verify assembly tolerances under physical constraints, enabling smarter quality‑control loops.
- Because the method is gradient‑based, it can be integrated into larger differentiable pipelines (e.g., end‑to‑end policy learning where the perception module is jointly trained with the controller).
Limitations & Future Work
- The current formulation assumes rigid, convex‑hull‑approximated objects; deformable or highly concave items would need additional handling.
- Texture refinement relies on a single RGB view; complex lighting or specular surfaces may limit visual fidelity.
- Real‑world deployment still requires a decent initial detection; extreme occlusions can cause the optimizer to converge to local minima.
- Future directions include extending the contact model to soft contacts, supporting non‑convex primitives, and exploring online (frame‑by‑frame) updates for dynamic scenes.
Authors
- Wei‑Cheng Huang
- Jiaheng Han
- Xiaohan Ye
- Zherong Pan
- Kris Hauser
Paper Information
- arXiv ID: 2602.20150v1
- Categories: cs.RO, cs.CV
- Published: February 23, 2026
- PDF: Download PDF