[Paper] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
Source: arXiv - 2512.25073v1
Overview
GaMO (Geometry‑aware Multi‑view Outpainting) tackles a core pain point in 3‑D reconstruction: building accurate models from only a handful of camera views. By “outpainting” the existing images—extending their field of view instead of synthesizing entirely new viewpoints—the method keeps geometry consistent while dramatically widening scene coverage. The authors demonstrate that this zero‑shot diffusion approach outperforms prior diffusion‑based pipelines both in quality and speed, making sparse‑view reconstruction far more practical for real‑world projects.
Key Contributions
- Multi‑view outpainting: Expands each input image’s observable area rather than generating new camera poses, preserving geometric relationships across views.
- Geometry‑aware denoising: Introduces a diffusion denoiser that conditions on depth and camera geometry, reducing cross‑view inconsistencies.
- Zero‑shot operation: No task‑specific training required; the framework works directly with pretrained diffusion models.
- Speed boost: Achieves ~25× faster inference than state‑of‑the‑art diffusion‑based reconstruction pipelines (under 10 min for typical scenes).
- State‑of‑the‑art results: Sets new PSNR and LPIPS benchmarks on Replica and ScanNet++ for 3, 6, and 9 input views.
Methodology
- Input preprocessing – The sparse set of RGB‑D images is projected into a shared 3‑D coordinate system using known camera poses.
- Outpainting mask generation – For each view, a peripheral mask defines the region to be hallucinated (the “outpainted” area).
- Multi‑view conditioning – The diffusion model receives not only the masked RGB image but also depth maps and a coarse geometry proxy (e.g., a voxel grid or point cloud) that encodes the scene’s shape.
- Geometry‑aware denoising – During each diffusion step, the denoiser is guided by the geometry proxy, ensuring that the newly generated pixels align with the underlying 3‑D structure and with neighboring views.
- Fusion & reconstruction – Outpainted images are re‑projected into the global coordinate frame and merged using a volumetric fusion (TSDF) to produce the final mesh or point cloud.
All steps run on a single GPU; the diffusion backbone is a standard pretrained Stable Diffusion‑2 model, so developers can swap in alternative diffusion backbones without retraining.
Results & Findings
- Quantitative gains: On the Replica dataset, GaMO improves PSNR by 1.8 dB and reduces LPIPS by 0.07 compared to the previous best diffusion method when only 3 views are available. Similar margins appear on ScanNet++ across 6‑ and 9‑view setups.
- Coverage: Outpainting expands the observable scene area by ~30 % beyond the convex hull of the original cameras, eliminating the “blind‑spot” artifacts common in sparse‑view pipelines.
- Geometric consistency: Visual inspection shows far fewer stitching seams and depth discontinuities between adjacent outpainted views, thanks to the geometry‑aware denoiser.
- Speed: End‑to‑end processing for a typical indoor scene (≈2 M voxels) finishes in 8 min, versus >3 h for the closest diffusion baseline.
Practical Implications
- Rapid prototyping – Developers can now generate decent 3‑D reconstructions from a few handheld phone captures, enabling quick AR/VR content creation without a dense capture rig.
- Robotics & navigation – Sparse LiDAR or RGB‑D sensors on drones or autonomous vehicles can be complemented with outpainting to fill occluded regions, improving map completeness on the fly.
- Cost‑effective scanning – Companies offering 3‑D scanning services can reduce the number of required passes, cutting labor and equipment wear while still delivering high‑fidelity models.
- Plug‑and‑play integration – Because GaMO works zero‑shot with off‑the‑shelf diffusion models, it can be wrapped into existing pipelines (e.g., Unity, Unreal, Open3D) with minimal code changes.
Limitations & Future Work
- Reliance on accurate depth – The geometry proxy assumes reasonably correct depth; noisy depth sensors can degrade outpainting quality.
- Outdoor scalability – Experiments focus on indoor datasets; handling large‑scale outdoor scenes with varying illumination remains an open challenge.
- Model size – While faster than prior diffusion methods, the approach still depends on a heavyweight diffusion backbone, which may be prohibitive for edge devices.
- Future directions suggested by the authors include lightweight diffusion adapters, better handling of dynamic objects, and extending the outpainting concept to multimodal inputs (e.g., semantic masks).
Authors
- Yi‑Chuan Huang
- Hao‑Jen Chien
- Chin‑Yang Lin
- Ying‑Huan Chen
- Yu‑Lun Liu
Paper Information
- arXiv ID: 2512.25073v1
- Categories: cs.CV
- Published: December 31, 2025
- PDF: Download PDF