[Paper] REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image

Published: 1 week ago (May 28, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.30338v1

Overview

The paper REST3D tackles a long‑standing problem: turning a single RGB photograph into a 3‑D scene that not only looks right but also behaves correctly under physics. By fusing visual cues with a physics‑aware scene representation, the authors produce reconstruction pipelines that avoid floating objects, inter‑penetrations, and other stability issues that cripple downstream simulation, VR, and robotics applications.

Key Contributions

Agentic Physical Scene Understanding – Introduces a scene‑tree that encodes each object’s support relationship (what’s on the floor, what’s on top of what) from a gravity‑centric view.
Structure‑Guided Initialization – Leverages existing image‑to‑3D models but aligns their outputs to the scene‑tree, giving a physically plausible starting point.
Physics‑Constrained Refinement – Optimizes object poses with differentiable physics constraints (no inter‑penetration, support, center‑of‑mass stability) while preserving visual fidelity to the input image.
Comprehensive Evaluation – Demonstrates large reductions in physical errors (floating, penetration) on both synthetic benchmarks and real‑world photo collections, while keeping reconstruction quality competitive.
End‑to‑End Demo in VR – Shows that the reconstructed scenes can be directly imported into immersive environments for realistic human‑object interaction.

Methodology

Scene‑Tree Construction
- A lightweight neural module parses the input image and predicts a hierarchy: floor → supporting objects → supported objects.
- Each node stores estimated 3‑D pose, size, and a binary “supported‑by” link, providing a strong prior on how objects should be stacked.
Initial 3‑D Guess
- Off‑the‑shelf image‑to‑mesh networks (e.g., Pix2Vox, Im3D) generate coarse geometry for every detected object.
- The scene‑tree is used to snap these meshes into a physically plausible arrangement (e.g., placing a cup on a table rather than floating).
Physics‑Constrained Optimization
- A differentiable physics engine evaluates constraints: no inter‑penetration, support stability, and center‑of‑mass over the convex hull of supporting surfaces.
- An objective combines these physics penalties with a visual consistency term (projected silhouettes should match the original image).
- Gradient‑based refinement nudges object poses until both physics and visual terms are satisfied.
Output
- A fully textured, simulation‑ready scene graph (meshes + rigid‑body transforms) that can be exported to game engines, robotics simulators, or VR platforms.

Results & Findings

Dataset	Physical Error ↓	Reconstruction Quality (IoU)
Synthetic (SUN3D‑Phys)	‑78 % floating objects, ‑85 % inter‑penetrations vs. baseline	0.71 (≈ baseline)
Real‑World (COCO‑VR)	‑71 % floating, ‑80 % penetration	0.68 (baseline 0.66)

Stability in Simulation – When dropped into a physics engine, REST3D scenes remained static in >95 % of trials, compared to <60 % for prior single‑image methods.
Visual Fidelity – Despite the heavy physics regularization, silhouette overlap and texture alignment stayed on par with state‑of‑the‑art visual reconstruction pipelines.
User Study – Participants rated VR scenes built from REST3D as “more believable” (4.3/5) than those from competing methods (3.6/5).

Practical Implications

Rapid Content Creation – Game studios and AR/VR developers can turn concept art or product photos into ready‑to‑use 3‑D assets without manual modeling or physics tweaking.
Robotics & Simulation – Training environments for manipulation or navigation can be auto‑generated from real‑world images, guaranteeing that simulated interactions respect real physics.
E‑Commerce & Virtual Try‑On – Retailers can generate stable 3‑D product displays from catalog photos, enabling realistic AR previews that don’t suffer from floating or clipping artifacts.
Digital Twin Construction – Facility managers can quickly digitize a workspace from a single shot, producing a physics‑accurate twin for safety analysis or layout planning.

Limitations & Future Work

Dependence on Accurate Object Detection – Mis‑detected or missing objects break the scene‑tree, leading to cascade errors in the refinement stage.
Simplified Physics Model – The current constraints assume rigid bodies and ignore deformable or articulated objects (e.g., curtains, cables).
Scalability to Highly Cluttered Scenes – As object count grows, the optimization becomes slower; the authors suggest hierarchical or learned solvers as next steps.
Generalization to Outdoor Environments – The gravity‑support prior works best indoors; extending the framework to outdoor scenes with uneven terrain is an open challenge.

Overall, REST3D marks a significant stride toward turning everyday photos into physically trustworthy 3‑D worlds, opening doors for faster prototyping, richer VR experiences, and more realistic simulation pipelines.

Authors

Xiaoxuan Ma
Jiashun Wang
Nicolas Ugrinovic
Yehonathan Litman
Kris Kitani

Paper Information

arXiv ID: 2605.30338v1
Categories: cs.CV
Published: May 28, 2026
PDF: Download PDF

[Paper] REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

[Paper] KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

[Paper] TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

[Paper] Vision-Language Models Suppress Female Representations Under Ambiguous Input