[Paper] SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

Published: 1 week ago (June 2, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2606.03994v1

Overview

Reconstructing a full 3‑D scene from just one RGB image has long been a dream for robotics, AR/VR, and game developers. SimuScene pushes the field forward by not only generating object meshes but also guaranteeing that the assembled scene stays physically stable when dropped into a physics engine. The authors close the gap between “looks‑good” reconstructions and “simulation‑ready” environments, a step that could dramatically speed up data‑generation pipelines for manipulation and embodied AI.

Key Contributions

Physics‑in‑the‑loop reconstruction – integrates a physics engine during shape and layout generation, turning collision/penetration errors into quantitative correction signals.
Gravity‑axis stretching & amodal shape resampling – novel geometric adjustments driven by simulated gravity that fix interpenetrations and floating objects on the fly.
Compositional pipeline – treats each object independently (shape, pose) yet jointly optimizes them for a globally stable scene.
State‑of‑the‑art stability & alignment – achieves the highest scores on benchmark metrics for physical stability and geometric accuracy.
Real‑world validation – demonstrates the reconstructed scenes in humanoid control and robot‑arm manipulation tasks, showing immediate utility for downstream robotics pipelines.

Methodology

Single‑image object lifting – a pretrained neural “lifters” predicts a coarse 3‑D mesh and pose for every detected object in the image.
Initial composition – the meshes are placed in a shared coordinate system according to the predicted poses, forming a raw scene that often contains interpenetrations or floating objects.
Physics diagnostic loop
- The raw scene is dropped into a lightweight physics engine (e.g., PyBullet) under gravity.
- The engine reports penetration depth and support failures for each object.
- These metrics are transformed into gradient‑like correction signals.
Geometry correction
- Gravity‑axis stretching: scales objects along the gravity direction to resolve penetrations without altering visual fidelity.
- Amodal shape resampling: re‑generates the hidden (amodal) part of an object’s mesh when support errors indicate an unrealistic shape.
Iterative refinement – steps 3‑4 repeat until the physics engine reports a stable, non‑intersecting configuration, yielding a simulation‑ready scene.

The whole pipeline runs end‑to‑end without needing a separate post‑hoc cleanup stage, making the physics engine an active participant rather than a mere after‑thought.

Results & Findings

Metric	Prior Art	SimuScene
Physical stability (percentage of scenes that settle without interpenetration)	71 %	92 %
Mean 3‑D IoU with ground‑truth meshes	0.48	0.57
Average pose error (degrees)	12.3°	7.1°

Stability boost: The physics‑in‑the‑loop approach reduces catastrophic failures (objects sinking or hovering) by > 20 % compared to post‑hoc correction methods.
Geometric fidelity: Even with the extra physics constraints, the reconstructed shapes stay closer to ground truth than baseline lifters.
Task transfer: In a simulated robot‑arm pick‑and‑place benchmark, policies trained on SimuScene‑generated environments achieved a 15 % higher success rate than those trained on conventional single‑image reconstructions.

Practical Implications

Rapid synthetic data generation – developers can turn a single photo of a tabletop or a room into a physics‑ready simulation, slashing the time and cost of manual 3‑D modeling.
Robotics simulation pipelines – SimuScene can feed directly into ROS/Gazebo or Unity‑based simulators, enabling more realistic training environments for manipulation, navigation, and human‑robot interaction.
AR/VR content creation – game studios and AR developers can auto‑generate stable scene assets from concept art or reference photos, reducing manual asset authoring.
Digital twins for inspection – maintenance bots can reconstruct a workspace from a single snapshot and immediately run safety checks (e.g., ensuring no hidden collisions) before deployment.

Limitations & Future Work

Dependence on object detection quality – mis‑detections or missing objects propagate errors into the physics loop; improving upstream perception remains critical.
Simplified material assumptions – the current physics model treats all objects as rigid, uniform density bodies, limiting realism for deformable or articulated items.
Scalability to cluttered scenes – while effective on tabletop setups, performance degrades with dozens of tightly packed objects; future work will explore hierarchical or sparse simulation strategies.
Real‑world transfer – the pipeline has been validated mainly in simulation; bridging the gap to noisy real‑world sensor data (e.g., lighting variations, occlusions) is an open challenge.

Overall, SimuScene demonstrates that embedding physics directly into the reconstruction loop is a practical and powerful way to generate simulation‑ready 3‑D scenes from a single image—opening new doors for developers building the next generation of embodied AI systems.

Authors

Inhee Lee
Sangwon Baik
Sungjoo Kim
Hyeonwoo Kim
Hyunsoo Cha
Hanbyul Joo

Paper Information

arXiv ID: 2606.03994v1
Categories: cs.CV, cs.RO
Published: June 2, 2026
PDF: Download PDF

[Paper] SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniSHARP: Universal Sharp Monocular View Synthesis

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Streaming Video Generation with Streaming Force Control

[Paper] Differences in Detection: Explainability Where it Matters