[Paper] SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image
Source: arXiv - 2606.03994v1
Overview
Reconstructing a full 3‑D scene from just one RGB image has long been a dream for robotics, AR/VR, and game developers. SimuScene pushes the field forward by not only generating object meshes but also guaranteeing that the assembled scene stays physically stable when dropped into a physics engine. The authors close the gap between “looks‑good” reconstructions and “simulation‑ready” environments, a step that could dramatically speed up data‑generation pipelines for manipulation and embodied AI.
Key Contributions
- Physics‑in‑the‑loop reconstruction – integrates a physics engine during shape and layout generation, turning collision/penetration errors into quantitative correction signals.
- Gravity‑axis stretching & amodal shape resampling – novel geometric adjustments driven by simulated gravity that fix interpenetrations and floating objects on the fly.
- Compositional pipeline – treats each object independently (shape, pose) yet jointly optimizes them for a globally stable scene.
- State‑of‑the‑art stability & alignment – achieves the highest scores on benchmark metrics for physical stability and geometric accuracy.
- Real‑world validation – demonstrates the reconstructed scenes in humanoid control and robot‑arm manipulation tasks, showing immediate utility for downstream robotics pipelines.
Methodology
- Single‑image object lifting – a pretrained neural “lifters” predicts a coarse 3‑D mesh and pose for every detected object in the image.
- Initial composition – the meshes are placed in a shared coordinate system according to the predicted poses, forming a raw scene that often contains interpenetrations or floating objects.
- Physics diagnostic loop
- The raw scene is dropped into a lightweight physics engine (e.g., PyBullet) under gravity.
- The engine reports penetration depth and support failures for each object.
- These metrics are transformed into gradient‑like correction signals.
- Geometry correction
- Gravity‑axis stretching: scales objects along the gravity direction to resolve penetrations without altering visual fidelity.
- Amodal shape resampling: re‑generates the hidden (amodal) part of an object’s mesh when support errors indicate an unrealistic shape.
- Iterative refinement – steps 3‑4 repeat until the physics engine reports a stable, non‑intersecting configuration, yielding a simulation‑ready scene.
The whole pipeline runs end‑to‑end without needing a separate post‑hoc cleanup stage, making the physics engine an active participant rather than a mere after‑thought.
Results & Findings
| Metric | Prior Art | SimuScene |
|---|---|---|
| Physical stability (percentage of scenes that settle without interpenetration) | 71 % | 92 % |
| Mean 3‑D IoU with ground‑truth meshes | 0.48 | 0.57 |
| Average pose error (degrees) | 12.3° | 7.1° |
- Stability boost: The physics‑in‑the‑loop approach reduces catastrophic failures (objects sinking or hovering) by > 20 % compared to post‑hoc correction methods.
- Geometric fidelity: Even with the extra physics constraints, the reconstructed shapes stay closer to ground truth than baseline lifters.
- Task transfer: In a simulated robot‑arm pick‑and‑place benchmark, policies trained on SimuScene‑generated environments achieved a 15 % higher success rate than those trained on conventional single‑image reconstructions.
Practical Implications
- Rapid synthetic data generation – developers can turn a single photo of a tabletop or a room into a physics‑ready simulation, slashing the time and cost of manual 3‑D modeling.
- Robotics simulation pipelines – SimuScene can feed directly into ROS/Gazebo or Unity‑based simulators, enabling more realistic training environments for manipulation, navigation, and human‑robot interaction.
- AR/VR content creation – game studios and AR developers can auto‑generate stable scene assets from concept art or reference photos, reducing manual asset authoring.
- Digital twins for inspection – maintenance bots can reconstruct a workspace from a single snapshot and immediately run safety checks (e.g., ensuring no hidden collisions) before deployment.
Limitations & Future Work
- Dependence on object detection quality – mis‑detections or missing objects propagate errors into the physics loop; improving upstream perception remains critical.
- Simplified material assumptions – the current physics model treats all objects as rigid, uniform density bodies, limiting realism for deformable or articulated items.
- Scalability to cluttered scenes – while effective on tabletop setups, performance degrades with dozens of tightly packed objects; future work will explore hierarchical or sparse simulation strategies.
- Real‑world transfer – the pipeline has been validated mainly in simulation; bridging the gap to noisy real‑world sensor data (e.g., lighting variations, occlusions) is an open challenge.
Overall, SimuScene demonstrates that embedding physics directly into the reconstruction loop is a practical and powerful way to generate simulation‑ready 3‑D scenes from a single image—opening new doors for developers building the next generation of embodied AI systems.
Authors
- Inhee Lee
- Sangwon Baik
- Sungjoo Kim
- Hyeonwoo Kim
- Hyunsoo Cha
- Hanbyul Joo
Paper Information
- arXiv ID: 2606.03994v1
- Categories: cs.CV, cs.RO
- Published: June 2, 2026
- PDF: Download PDF