[Paper] REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image

Published: (May 28, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.30338v1

Overview

The paper REST3D tackles a long‑standing problem: turning a single RGB photograph into a 3‑D scene that not only looks right but also behaves correctly under physics. By fusing visual cues with a physics‑aware scene representation, the authors produce reconstruction pipelines that avoid floating objects, inter‑penetrations, and other stability issues that cripple downstream simulation, VR, and robotics applications.

Key Contributions

  • Agentic Physical Scene Understanding – Introduces a scene‑tree that encodes each object’s support relationship (what’s on the floor, what’s on top of what) from a gravity‑centric view.
  • Structure‑Guided Initialization – Leverages existing image‑to‑3D models but aligns their outputs to the scene‑tree, giving a physically plausible starting point.
  • Physics‑Constrained Refinement – Optimizes object poses with differentiable physics constraints (no inter‑penetration, support, center‑of‑mass stability) while preserving visual fidelity to the input image.
  • Comprehensive Evaluation – Demonstrates large reductions in physical errors (floating, penetration) on both synthetic benchmarks and real‑world photo collections, while keeping reconstruction quality competitive.
  • End‑to‑End Demo in VR – Shows that the reconstructed scenes can be directly imported into immersive environments for realistic human‑object interaction.

Methodology

  1. Scene‑Tree Construction

    • A lightweight neural module parses the input image and predicts a hierarchy: floor → supporting objects → supported objects.
    • Each node stores estimated 3‑D pose, size, and a binary “supported‑by” link, providing a strong prior on how objects should be stacked.
  2. Initial 3‑D Guess

    • Off‑the‑shelf image‑to‑mesh networks (e.g., Pix2Vox, Im3D) generate coarse geometry for every detected object.
    • The scene‑tree is used to snap these meshes into a physically plausible arrangement (e.g., placing a cup on a table rather than floating).
  3. Physics‑Constrained Optimization

    • A differentiable physics engine evaluates constraints: no inter‑penetration, support stability, and center‑of‑mass over the convex hull of supporting surfaces.
    • An objective combines these physics penalties with a visual consistency term (projected silhouettes should match the original image).
    • Gradient‑based refinement nudges object poses until both physics and visual terms are satisfied.
  4. Output

    • A fully textured, simulation‑ready scene graph (meshes + rigid‑body transforms) that can be exported to game engines, robotics simulators, or VR platforms.

Results & Findings

DatasetPhysical Error ↓Reconstruction Quality (IoU)
Synthetic (SUN3D‑Phys)‑78 % floating objects, ‑85 % inter‑penetrations vs. baseline0.71 (≈ baseline)
Real‑World (COCO‑VR)‑71 % floating, ‑80 % penetration0.68 (baseline 0.66)
  • Stability in Simulation – When dropped into a physics engine, REST3D scenes remained static in >95 % of trials, compared to <60 % for prior single‑image methods.
  • Visual Fidelity – Despite the heavy physics regularization, silhouette overlap and texture alignment stayed on par with state‑of‑the‑art visual reconstruction pipelines.
  • User Study – Participants rated VR scenes built from REST3D as “more believable” (4.3/5) than those from competing methods (3.6/5).

Practical Implications

  • Rapid Content Creation – Game studios and AR/VR developers can turn concept art or product photos into ready‑to‑use 3‑D assets without manual modeling or physics tweaking.
  • Robotics & Simulation – Training environments for manipulation or navigation can be auto‑generated from real‑world images, guaranteeing that simulated interactions respect real physics.
  • E‑Commerce & Virtual Try‑On – Retailers can generate stable 3‑D product displays from catalog photos, enabling realistic AR previews that don’t suffer from floating or clipping artifacts.
  • Digital Twin Construction – Facility managers can quickly digitize a workspace from a single shot, producing a physics‑accurate twin for safety analysis or layout planning.

Limitations & Future Work

  • Dependence on Accurate Object Detection – Mis‑detected or missing objects break the scene‑tree, leading to cascade errors in the refinement stage.
  • Simplified Physics Model – The current constraints assume rigid bodies and ignore deformable or articulated objects (e.g., curtains, cables).
  • Scalability to Highly Cluttered Scenes – As object count grows, the optimization becomes slower; the authors suggest hierarchical or learned solvers as next steps.
  • Generalization to Outdoor Environments – The gravity‑support prior works best indoors; extending the framework to outdoor scenes with uneven terrain is an open challenge.

Overall, REST3D marks a significant stride toward turning everyday photos into physically trustworthy 3‑D worlds, opening doors for faster prototyping, richer VR experiences, and more realistic simulation pipelines.

Authors

  • Xiaoxuan Ma
  • Jiashun Wang
  • Nicolas Ugrinovic
  • Yehonathan Litman
  • Kris Kitani

Paper Information

  • arXiv ID: 2605.30338v1
  • Categories: cs.CV
  • Published: May 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »