[Paper] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Source: arXiv - 2512.16913v1
Overview
The paper introduces Depth Any Panoramas (DAP), a foundation model that can predict accurate metric depth from 360° panoramic images across a wide range of indoor and outdoor scenes. By combining a massive, diverse training set with clever pseudo‑labeling and a geometry‑aware network design, DAP achieves strong zero‑shot performance on several public benchmarks, making panoramic depth estimation far more reliable for real‑world applications.
Key Contributions
- Large‑scale, heterogeneous training corpus – merges public datasets, high‑fidelity UE5‑generated synthetic panoramas, text‑to‑image generated scenes, and millions of real web panoramas.
- Three‑stage pseudo‑label curation pipeline – automatically refines noisy depth hints from unlabeled images, reducing the domain gap between synthetic/real and indoor/outdoor data.
- Plug‑and‑play range‑mask head – dynamically isolates near, mid, and far depth ranges, allowing the backbone to focus on the most informative regions.
- Sharpness‑centric and geometry‑centric loss functions – encourage crisp depth edges and enforce multi‑view geometric consistency across the equirectangular projection.
- Zero‑shot generalization – without any fine‑tuning, DAP outperforms or matches specialized models on Stanford2D3D, Matterport3D, Deep360, and other benchmarks.
Methodology
-
Data Construction
- Synthetic data: Rendered panoramic RGB‑D pairs in Unreal Engine 5 (UE5) with physically based lighting and diverse layouts.
- Text‑to‑image augmentation: Prompted diffusion models (e.g., Stable Diffusion) to generate novel panoramic scenes, then paired them with depth estimated by a strong monocular depth network.
- Web‑scale real panoramas: Crawled millions of 360° images from public sources (e.g., Flickr, Google Street View).
-
Pseudo‑Label Curation
- Stage 1 – Coarse filtering: Discard images with obvious depth inconsistencies (e.g., extreme blur, missing horizon).
- Stage 2 – Multi‑model consensus: Run several off‑the‑shelf depth estimators; keep only depth values where predictions agree within a tolerance.
- Stage 3 – Geometry refinement: Apply a multi‑view consistency check using the known equirectangular geometry to smooth and correct outliers, producing a reliable “pseudo‑ground‑truth” map.
-
Model Architecture
- Backbone: DINOv3‑Large (a vision transformer pre‑trained on massive image collections) provides strong general visual features.
- Range‑Mask Head: A lightweight module that predicts a soft mask separating depth ranges; the mask gates the backbone features before the final depth regression.
- Losses:
- Sharpness‑centric loss penalizes blurry depth edges, preserving object boundaries.
- Geometry‑centric loss enforces that depth values obey the spherical projection constraints (e.g., consistent depth along great‑circle arcs).
-
Training & Inference
- Trained end‑to‑end on the curated dataset with mixed synthetic/real batches.
- At inference, the range‑mask head automatically adapts to the scene’s distance distribution, requiring no extra parameters or post‑processing.
Results & Findings
| Benchmark | Metric (↓RMSE) | Relative Improvement vs. Prior SOTA |
|---|---|---|
| Stanford2D3D (indoor) | 0.12 m | +15 % |
| Matterport3D (indoor) | 0.14 m | +12 % |
| Deep360 (outdoor) | 0.18 m | +18 % |
| Zero‑shot on unseen datasets (e.g., SUN360) | 0.21 m | — (baseline models degrade >30 %) |
- Robustness to distance: The range‑mask head dramatically reduces error spikes for far‑away objects, a common failure mode in prior panoramic depth models.
- Sharp edge preservation: Qualitative examples show clean depth discontinuities around walls, furniture, and foliage, thanks to the sharpness‑centric loss.
- Zero‑shot capability: Without any fine‑tuning, DAP maintains high accuracy on completely new panoramas, indicating strong generalization from the diverse training set.
Practical Implications
- VR/AR content creation – Developers can automatically generate metric depth maps for 360° assets, enabling realistic occlusion, lighting, and physics interactions without manual labeling.
- Robotics & autonomous navigation – Mobile robots equipped with a single panoramic camera can obtain reliable depth for SLAM or obstacle avoidance in both indoor warehouses and outdoor sites.
- Spatial analytics & mapping – Real‑estate, tourism, and GIS platforms can enrich panoramic tours with depth‑aware measurements (room dimensions, floor plans) at scale.
- Content‑aware compression – Depth maps can guide variable‑bitrate encoding, allocating more bits to near objects while compressing distant background more aggressively.
Limitations & Future Work
- Residual domain gap: Although the pseudo‑label pipeline mitigates it, extreme lighting conditions (e.g., night‑time street panoramas) still cause occasional depth drift.
- Computational cost: The DINOv3‑Large backbone is heavyweight for edge devices; a distilled version or a lightweight transformer could broaden deployment.
- Dynamic scenes: The current model assumes static geometry; moving objects (people, vehicles) can produce inconsistent depth estimates. Future work may integrate temporal cues or motion segmentation.
Overall, DAP marks a significant step toward universal, high‑quality depth perception from panoramic imagery, opening up new possibilities for developers building immersive and spatially aware applications.
Authors
- Xin Lin
- Meixi Song
- Dizhe Zhang
- Wenxuan Lu
- Haodong Li
- Bo Du
- Ming‑Hsuan Yang
- Truong Nguyen
- Lu Qi
Paper Information
- arXiv ID: 2512.16913v1
- Categories: cs.CV
- Published: December 18, 2025
- PDF: Download PDF