[Paper] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Published: (December 18, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16913v1

Overview

The paper introduces Depth Any Panoramas (DAP), a foundation model that can predict accurate metric depth from 360° panoramic images across a wide range of indoor and outdoor scenes. By combining a massive, diverse training set with clever pseudo‑labeling and a geometry‑aware network design, DAP achieves strong zero‑shot performance on several public benchmarks, making panoramic depth estimation far more reliable for real‑world applications.

Key Contributions

  • Large‑scale, heterogeneous training corpus – merges public datasets, high‑fidelity UE5‑generated synthetic panoramas, text‑to‑image generated scenes, and millions of real web panoramas.
  • Three‑stage pseudo‑label curation pipeline – automatically refines noisy depth hints from unlabeled images, reducing the domain gap between synthetic/real and indoor/outdoor data.
  • Plug‑and‑play range‑mask head – dynamically isolates near, mid, and far depth ranges, allowing the backbone to focus on the most informative regions.
  • Sharpness‑centric and geometry‑centric loss functions – encourage crisp depth edges and enforce multi‑view geometric consistency across the equirectangular projection.
  • Zero‑shot generalization – without any fine‑tuning, DAP outperforms or matches specialized models on Stanford2D3D, Matterport3D, Deep360, and other benchmarks.

Methodology

  1. Data Construction

    • Synthetic data: Rendered panoramic RGB‑D pairs in Unreal Engine 5 (UE5) with physically based lighting and diverse layouts.
    • Text‑to‑image augmentation: Prompted diffusion models (e.g., Stable Diffusion) to generate novel panoramic scenes, then paired them with depth estimated by a strong monocular depth network.
    • Web‑scale real panoramas: Crawled millions of 360° images from public sources (e.g., Flickr, Google Street View).
  2. Pseudo‑Label Curation

    • Stage 1 – Coarse filtering: Discard images with obvious depth inconsistencies (e.g., extreme blur, missing horizon).
    • Stage 2 – Multi‑model consensus: Run several off‑the‑shelf depth estimators; keep only depth values where predictions agree within a tolerance.
    • Stage 3 – Geometry refinement: Apply a multi‑view consistency check using the known equirectangular geometry to smooth and correct outliers, producing a reliable “pseudo‑ground‑truth” map.
  3. Model Architecture

    • Backbone: DINOv3‑Large (a vision transformer pre‑trained on massive image collections) provides strong general visual features.
    • Range‑Mask Head: A lightweight module that predicts a soft mask separating depth ranges; the mask gates the backbone features before the final depth regression.
    • Losses:
      • Sharpness‑centric loss penalizes blurry depth edges, preserving object boundaries.
      • Geometry‑centric loss enforces that depth values obey the spherical projection constraints (e.g., consistent depth along great‑circle arcs).
  4. Training & Inference

    • Trained end‑to‑end on the curated dataset with mixed synthetic/real batches.
    • At inference, the range‑mask head automatically adapts to the scene’s distance distribution, requiring no extra parameters or post‑processing.

Results & Findings

BenchmarkMetric (↓RMSE)Relative Improvement vs. Prior SOTA
Stanford2D3D (indoor)0.12 m+15 %
Matterport3D (indoor)0.14 m+12 %
Deep360 (outdoor)0.18 m+18 %
Zero‑shot on unseen datasets (e.g., SUN360)0.21 m— (baseline models degrade >30 %)
  • Robustness to distance: The range‑mask head dramatically reduces error spikes for far‑away objects, a common failure mode in prior panoramic depth models.
  • Sharp edge preservation: Qualitative examples show clean depth discontinuities around walls, furniture, and foliage, thanks to the sharpness‑centric loss.
  • Zero‑shot capability: Without any fine‑tuning, DAP maintains high accuracy on completely new panoramas, indicating strong generalization from the diverse training set.

Practical Implications

  • VR/AR content creation – Developers can automatically generate metric depth maps for 360° assets, enabling realistic occlusion, lighting, and physics interactions without manual labeling.
  • Robotics & autonomous navigation – Mobile robots equipped with a single panoramic camera can obtain reliable depth for SLAM or obstacle avoidance in both indoor warehouses and outdoor sites.
  • Spatial analytics & mapping – Real‑estate, tourism, and GIS platforms can enrich panoramic tours with depth‑aware measurements (room dimensions, floor plans) at scale.
  • Content‑aware compression – Depth maps can guide variable‑bitrate encoding, allocating more bits to near objects while compressing distant background more aggressively.

Limitations & Future Work

  • Residual domain gap: Although the pseudo‑label pipeline mitigates it, extreme lighting conditions (e.g., night‑time street panoramas) still cause occasional depth drift.
  • Computational cost: The DINOv3‑Large backbone is heavyweight for edge devices; a distilled version or a lightweight transformer could broaden deployment.
  • Dynamic scenes: The current model assumes static geometry; moving objects (people, vehicles) can produce inconsistent depth estimates. Future work may integrate temporal cues or motion segmentation.

Overall, DAP marks a significant step toward universal, high‑quality depth perception from panoramic imagery, opening up new possibilities for developers building immersive and spatially aware applications.

Authors

  • Xin Lin
  • Meixi Song
  • Dizhe Zhang
  • Wenxuan Lu
  • Haodong Li
  • Bo Du
  • Ming‑Hsuan Yang
  • Truong Nguyen
  • Lu Qi

Paper Information

  • arXiv ID: 2512.16913v1
  • Categories: cs.CV
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Dexterous World Models

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largel...