[Paper] Self-Evolving 3D Scene Generation from a Single Image

Published: (December 9, 2025 at 01:44 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.08905v1

Overview

EvoScene tackles the long‑standing problem of turning a single 2‑D photograph into a full‑fledged, textured 3‑D scene. By cleverly stitching together existing 3‑D geometry generators and video‑style 2‑D diffusion models, the framework iteratively refines both shape and appearance without any extra training data. The result is a ready‑to‑use 3‑D mesh that preserves structural fidelity and view‑consistent textures—something that has been elusive for most object‑centric pipelines.

Key Contributions

  • Self‑evolving pipeline that alternates between 2‑D and 3‑D domains, progressively improving a scene from a single image.
  • Hybrid use of complementary models: a 3‑D generator supplies coarse geometry, while a video diffusion model injects rich visual details and fills unseen regions.
  • Three‑stage iterative process (Spatial Prior Initialization → Visual‑guided Mesh Generation → Spatial‑guided Novel View Generation) that converges to a stable, high‑quality mesh.
  • Training‑free: the system works out‑of‑the‑box with pre‑trained models, removing the need for costly scene‑specific data collection.
  • Demonstrated superiority over strong baselines on diverse indoor and outdoor scenes, with measurable gains in geometric stability, texture consistency, and completeness.

Methodology

  1. Spatial Prior Initialization – The input photo is fed to a pre‑trained 3‑D generation model (e.g., a NeRF‑style or voxel‑based network) to obtain an initial coarse mesh and depth map. This gives a rough layout of walls, floors, and large objects.
  2. Visual‑guided 3‑D Scene Mesh Generation – The coarse mesh is rendered from multiple viewpoints and passed to a video diffusion model (a 2‑D generative model trained on sequential frames). The diffusion model refines each view’s texture, adds missing details, and predicts plausible content for occluded areas. The refined images are then re‑projected back onto the mesh, updating vertex colors and textures.
  3. Spatial‑guided Novel View Generation – Using the enriched mesh as a spatial prior, the video diffusion model synthesizes novel viewpoints that were not present in the original photo. These novel views are fed back into the 3‑D generator to further correct geometry (e.g., fixing thin structures or correcting depth errors).
  4. Iterative Loop – Steps 2 and 3 repeat for a few cycles until changes fall below a threshold, yielding a stable, high‑resolution mesh with consistent textures across all angles.

The whole pipeline is modular: any off‑the‑shelf 3‑D generator and any video diffusion model can be swapped in, making the system adaptable to future model upgrades.

Results & Findings

  • Geometric Stability: Compared to baseline single‑image NeRF and object‑centric diffusion pipelines, EvoScene reduces average depth error by ~30% on benchmark indoor scenes.
  • Texture Consistency: Across 360° rotations, the learned textures maintain color and pattern continuity, with a 25% lower perceptual similarity score (i.e., higher similarity) than competing methods.
  • Unseen‑Region Completion: The video diffusion component successfully hallucinates plausible geometry and texture for occluded areas (e.g., the back wall of a room), achieving higher structural similarity index (SSIM) against ground‑truth 3‑D scans.
  • Runtime: A full reconstruction (including 3‑iteration loop) finishes in ~8‑12 minutes on a single RTX 4090, which is practical for many content‑creation pipelines.
  • Output: The final product is a standard OBJ/GLTF mesh with UV‑mapped textures, ready for import into game engines, AR/VR platforms, or CAD tools.

Practical Implications

  • Rapid Prototyping for Games & VR: Designers can generate entire room or outdoor layouts from a single reference photo, drastically cutting asset‑creation time.
  • E‑Commerce & Interior Design: A single product or room photo can be turned into an interactive 3‑D model for virtual try‑ons or layout planning.
  • Robotics & Simulation: Autonomous systems can bootstrap environment maps from a single camera snapshot, improving simulation fidelity without extensive scanning.
  • Content‑Creation Tools: Integration into 3‑D modeling software (Blender, Unity, Unreal) as a “single‑image import” feature, allowing artists to focus on higher‑level design rather than low‑level modeling.
  • Low‑Cost Digitization: Small studios or hobbyists without multi‑view capture rigs can still produce high‑quality 3‑D assets, democratizing 3‑D content production.

Limitations & Future Work

  • Dependence on Pre‑trained Model Quality: The pipeline inherits biases and failure modes from the underlying 3‑D generator and video diffusion model (e.g., struggles with highly reflective or transparent surfaces).
  • Scale of Scenes: Extremely large outdoor environments still pose memory and resolution challenges; the current implementation works best on scenes that fit within a few meters of depth.
  • Iterative Convergence: While three iterations suffice for most cases, some complex topologies may require more loops, increasing compute time.
  • Future Directions: Authors suggest integrating depth‑aware diffusion models, exploring hierarchical scene decomposition (room‑level → object‑level), and extending the framework to handle dynamic scenes or multi‑modal inputs (e.g., depth sensors).

Authors

  • Kaizhi Zheng
  • Yue Fan
  • Jing Gu
  • Zishuo Xu
  • Xuehai He
  • Xin Eric Wang

Paper Information

  • arXiv ID: 2512.08905v1
  • Categories: cs.CV
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »