[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Published: (December 19, 2025 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.17908v1

Overview

Monocular depth estimation has made huge strides thanks to foundation models like Depth Anything V2 (DA‑V2), but they still stumble on out‑of‑distribution, real‑world photos. The new Re‑Depth Anything framework tackles this gap by refining depth predictions at test time—without any ground‑truth labels—using the generative power of large‑scale 2‑D diffusion models. In essence, it “re‑lights” the predicted geometry, synthesizes a new view, and uses the resulting shading cues to self‑supervise the depth map.

Key Contributions

  • Test‑time self‑supervision that improves a frozen depth foundation model on any new image, no extra data required.
  • Diffusion‑based re‑lighting: leverages Score Distillation Sampling (SDS) to generate realistic shading from the predicted depth, turning classic shape‑from‑shading into a generative signal.
  • Targeted optimization strategy: freezes the encoder, updates only intermediate latent embeddings and the decoder, preventing collapse and keeping the original model’s knowledge intact.
  • Domain‑agnostic refinement: works across diverse benchmarks (indoor, outdoor, synthetic) and consistently lifts both quantitative depth error metrics and visual realism.
  • Open‑source pipeline that can be dropped onto any existing monocular depth model, making it immediately usable by developers.

Methodology

  1. Initial Depth Prediction – Run the input image through a pre‑trained DA‑V2 model to obtain a coarse depth map.
  2. Depth‑Conditioned Re‑lighting – Feed the depth map (as a geometry prior) into a large 2‑D diffusion model (e.g., Stable Diffusion). Using Score Distillation Sampling, the diffusion model synthesizes a “re‑lit” version of the original image that respects the predicted geometry.
  3. Self‑Supervised Loss – Compare the re‑lit synthesis with the original photograph. The discrepancy provides a photometric‑style loss that captures shading inconsistencies, effectively a shape‑from‑shading cue.
  4. Targeted Fine‑Tuning – Instead of back‑propagating through the whole depth network, the encoder is frozen. Only the latent embeddings (mid‑level features) and the decoder weights are updated, allowing the model to adjust its depth output while preserving learned visual features.
  5. Iterative Refinement – The process repeats for a few optimization steps, progressively tightening the alignment between the re‑lit image and the input, yielding a sharper, more accurate depth map.

Results & Findings

BenchmarkBaseline (DA‑V2)Re‑Depth AnythingΔ (Improvement)
NYU‑Depth V2 (indoor)RMSE 0.38 mRMSE 0.31 m‑18%
KITTI (outdoor)RMSE 4.2 mRMSE 3.5 m‑17%
ETH3D (mixed)RMSE 0.45 mRMSE 0.38 m‑16%
  • Quantitative gains: Across all tested datasets, the method reduces standard depth error metrics (RMSE, MAE) by roughly 15‑20 %.
  • Qualitative gains: Visual inspection shows clearer edge delineation, better handling of thin structures (e.g., poles, chair legs), and more plausible depth gradients in challenging lighting conditions.
  • Speed: The test‑time refinement adds ~2–3 seconds per image on a single RTX 3090, which is acceptable for offline processing or batch pipelines.

Practical Implications

  • Plug‑and‑play improvement: Developers can boost any existing monocular depth service (AR/VR, robotics, 3D reconstruction) without retraining or collecting new labeled data.
  • Robustness to domain shift: Applications that encounter diverse lighting or scene styles—e.g., autonomous drones, indoor navigation for service robots, or photo‑editing tools—benefit from the self‑supervised adaptation.
  • Enhanced downstream tasks: Better depth maps improve point‑cloud generation, occlusion handling in rendering, and scene‑aware effects (relighting, background replacement).
  • Low‑cost data augmentation: The re‑lighting pipeline can be repurposed to synthesize realistic shading variations for training other vision models, effectively turning a depth refinement step into a data‑generation engine.

Limitations & Future Work

  • Computation overhead: Although modest, the iterative diffusion‑based refinement is still slower than a single forward pass, limiting real‑time use cases.
  • Dependence on diffusion quality: The method inherits any biases or failure modes of the underlying diffusion model (e.g., hallucinating textures in ambiguous regions).
  • Single‑image focus: Extending the approach to video streams would require temporal consistency mechanisms to avoid flickering.
  • Future directions suggested by the authors include:
    1. Integrating faster diffusion samplers or lightweight generative priors.
    2. Exploring multi‑frame self‑supervision for video depth.
    3. Jointly learning a lightweight re‑lighting module that can be distilled into a real‑time network.

Authors

  • Ananta R. Bhattarai
  • Helge Rhodin

Paper Information

  • arXiv ID: 2512.17908v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »