[Paper] REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image
Source: arXiv - 2605.30338v1
Overview
The paper REST3D tackles a long‑standing problem: turning a single RGB photograph into a 3‑D scene that not only looks right but also behaves correctly under physics. By fusing visual cues with a physics‑aware scene representation, the authors produce reconstruction pipelines that avoid floating objects, inter‑penetrations, and other stability issues that cripple downstream simulation, VR, and robotics applications.
Key Contributions
- Agentic Physical Scene Understanding – Introduces a scene‑tree that encodes each object’s support relationship (what’s on the floor, what’s on top of what) from a gravity‑centric view.
- Structure‑Guided Initialization – Leverages existing image‑to‑3D models but aligns their outputs to the scene‑tree, giving a physically plausible starting point.
- Physics‑Constrained Refinement – Optimizes object poses with differentiable physics constraints (no inter‑penetration, support, center‑of‑mass stability) while preserving visual fidelity to the input image.
- Comprehensive Evaluation – Demonstrates large reductions in physical errors (floating, penetration) on both synthetic benchmarks and real‑world photo collections, while keeping reconstruction quality competitive.
- End‑to‑End Demo in VR – Shows that the reconstructed scenes can be directly imported into immersive environments for realistic human‑object interaction.
Methodology
-
Scene‑Tree Construction
- A lightweight neural module parses the input image and predicts a hierarchy: floor → supporting objects → supported objects.
- Each node stores estimated 3‑D pose, size, and a binary “supported‑by” link, providing a strong prior on how objects should be stacked.
-
Initial 3‑D Guess
- Off‑the‑shelf image‑to‑mesh networks (e.g., Pix2Vox, Im3D) generate coarse geometry for every detected object.
- The scene‑tree is used to snap these meshes into a physically plausible arrangement (e.g., placing a cup on a table rather than floating).
-
Physics‑Constrained Optimization
- A differentiable physics engine evaluates constraints: no inter‑penetration, support stability, and center‑of‑mass over the convex hull of supporting surfaces.
- An objective combines these physics penalties with a visual consistency term (projected silhouettes should match the original image).
- Gradient‑based refinement nudges object poses until both physics and visual terms are satisfied.
-
Output
- A fully textured, simulation‑ready scene graph (meshes + rigid‑body transforms) that can be exported to game engines, robotics simulators, or VR platforms.
Results & Findings
| Dataset | Physical Error ↓ | Reconstruction Quality (IoU) |
|---|---|---|
| Synthetic (SUN3D‑Phys) | ‑78 % floating objects, ‑85 % inter‑penetrations vs. baseline | 0.71 (≈ baseline) |
| Real‑World (COCO‑VR) | ‑71 % floating, ‑80 % penetration | 0.68 (baseline 0.66) |
- Stability in Simulation – When dropped into a physics engine, REST3D scenes remained static in >95 % of trials, compared to <60 % for prior single‑image methods.
- Visual Fidelity – Despite the heavy physics regularization, silhouette overlap and texture alignment stayed on par with state‑of‑the‑art visual reconstruction pipelines.
- User Study – Participants rated VR scenes built from REST3D as “more believable” (4.3/5) than those from competing methods (3.6/5).
Practical Implications
- Rapid Content Creation – Game studios and AR/VR developers can turn concept art or product photos into ready‑to‑use 3‑D assets without manual modeling or physics tweaking.
- Robotics & Simulation – Training environments for manipulation or navigation can be auto‑generated from real‑world images, guaranteeing that simulated interactions respect real physics.
- E‑Commerce & Virtual Try‑On – Retailers can generate stable 3‑D product displays from catalog photos, enabling realistic AR previews that don’t suffer from floating or clipping artifacts.
- Digital Twin Construction – Facility managers can quickly digitize a workspace from a single shot, producing a physics‑accurate twin for safety analysis or layout planning.
Limitations & Future Work
- Dependence on Accurate Object Detection – Mis‑detected or missing objects break the scene‑tree, leading to cascade errors in the refinement stage.
- Simplified Physics Model – The current constraints assume rigid bodies and ignore deformable or articulated objects (e.g., curtains, cables).
- Scalability to Highly Cluttered Scenes – As object count grows, the optimization becomes slower; the authors suggest hierarchical or learned solvers as next steps.
- Generalization to Outdoor Environments – The gravity‑support prior works best indoors; extending the framework to outdoor scenes with uneven terrain is an open challenge.
Overall, REST3D marks a significant stride toward turning everyday photos into physically trustworthy 3‑D worlds, opening doors for faster prototyping, richer VR experiences, and more realistic simulation pipelines.
Authors
- Xiaoxuan Ma
- Jiashun Wang
- Nicolas Ugrinovic
- Yehonathan Litman
- Kris Kitani
Paper Information
- arXiv ID: 2605.30338v1
- Categories: cs.CV
- Published: May 28, 2026
- PDF: Download PDF