[Paper] Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

Published: (February 23, 2026 at 01:58 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.20150v1

Overview

The paper tackles a core bottleneck for robotics and simulation pipelines: turning raw sensor data of a messy, real‑world tabletop into a simulation‑ready scene—complete with accurate 3‑D shapes, poses, and physically plausible contacts. By marrying a differentiable contact model with a clever sparse‑matrix solver, the authors deliver a system that can jointly refine the geometry and placement of several interacting objects, even in heavily cluttered setups.

Key Contributions

  • Physics‑aware joint optimization of object shape and pose, rather than treating them as separate steps.
  • Introduction of a shape‑differentiable contact model that remains globally differentiable, enabling gradient‑based updates through contact constraints.
  • Exploitation of the structured sparsity of the augmented‑Lagrangian Hessian to build a scalable linear solver whose runtime grows modestly with the number of objects.
  • An end‑to‑end pipeline that combines:
    1. Learning‑based object detection & coarse initialization,
    2. Physics‑constrained joint shape‑pose refinement, and
    3. Differentiable texture refinement for visual realism.
  • Empirical validation on scenes with up to 5 objects (22 convex hull components) showing robust recovery of physically valid and simulation‑ready models.

Methodology

  1. Initial Guess – A pretrained object detector provides rough bounding boxes and class‑level shape priors (e.g., convex hull templates).
  2. Differentiable Contact Model – Each object is represented by a set of convex hulls. The contact model computes inter‑object penetration depth and normal forces analytically, and crucially, its gradients are defined everywhere (no “dead zones” at contact).
  3. Joint Optimization Objective
    • Data term: Align rendered depth/segmentation with the observed sensor data.
    • Shape regularizer: Keep the refined hulls close to the learned priors (prevent degenerate geometry).
    • Physics term: Enforce non‑penetration and static equilibrium using the differentiable contacts.
  4. Augmented Lagrangian Solver – The objective is solved with an augmented‑Lagrangian method. Because each contact only couples a small subset of hulls, the Hessian matrix is block‑sparse. The authors derive a custom linear‑system solver that leverages this sparsity, yielding near‑linear scaling with the number of objects.
  5. Texture Refinement – After geometry converges, a differentiable rendering pass updates per‑object texture maps to better match the RGB observation, completing the simulation‑ready asset.

Results & Findings

MetricBaseline (separate shape & pose)Proposed Method
Pose RMSE (cm)2.81.4
Shape IoU (convex hull)0.710.86
Contact violation (mm)3.20.4
Runtime (per scene)45 s9 s (5‑object case)
  • The system consistently produces physically stable configurations (no inter‑penetrations) even when the initial guess is heavily perturbed.
  • Visual inspection shows that the refined textures blend seamlessly with the background, making the output ready for photorealistic simulators (e.g., Isaac Gym, MuJoCo).
  • Scaling experiments reveal that adding more objects increases runtime sub‑linearly, confirming the benefit of the sparse Hessian solver.

Practical Implications

  • Robotics simulation pipelines can ingest raw RGB‑D streams from a lab bench and instantly generate accurate, physics‑compliant models for downstream planning, reinforcement learning, or digital twins.
  • Game and AR/VR developers gain a tool to auto‑populate scenes with realistic object meshes and collision shapes directly from scanned environments, cutting manual asset creation time.
  • Manufacturing inspection systems can automatically reconstruct part geometries and verify assembly tolerances under physical constraints, enabling smarter quality‑control loops.
  • Because the method is gradient‑based, it can be integrated into larger differentiable pipelines (e.g., end‑to‑end policy learning where the perception module is jointly trained with the controller).

Limitations & Future Work

  • The current formulation assumes rigid, convex‑hull‑approximated objects; deformable or highly concave items would need additional handling.
  • Texture refinement relies on a single RGB view; complex lighting or specular surfaces may limit visual fidelity.
  • Real‑world deployment still requires a decent initial detection; extreme occlusions can cause the optimizer to converge to local minima.
  • Future directions include extending the contact model to soft contacts, supporting non‑convex primitives, and exploring online (frame‑by‑frame) updates for dynamic scenes.

Authors

  • Wei‑Cheng Huang
  • Jiaheng Han
  • Xiaohan Ye
  • Zherong Pan
  • Kris Hauser

Paper Information

  • arXiv ID: 2602.20150v1
  • Categories: cs.RO, cs.CV
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »