[Paper] Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation
Source: arXiv - 2606.02552v1
Overview
Depth‑estimation networks have become remarkably good at predicting dense 3‑D geometry from a single image, yet they still suffer from a notorious “flying‑point” artifact: spurious points appear in the empty space between foreground and background objects, especially around sharp edges. The paper “Modeling Depth Ambiguity: A Mixture‑Density Representation for Flying‑Point‑Free Depth Estimation” shows that this problem stems from the common practice of forcing each pixel to output a single depth value. By allowing a pixel to express multiple plausible depths together with their probabilities, the authors eliminate most flying points while keeping inference speed essentially unchanged.
Key Contributions
- Mixture‑Density Depth Representation (MDA): Introduces a probabilistic mixture model per pixel that can output several depth hypotheses and associated weights.
- Boundary‑aware Decoding: At object edges, different mixture components latch onto the foreground and background surfaces, preventing the network from collapsing to an unrealistic intermediate depth.
- Unified Treatment of Transparency & Sky: Extends the mixture model to handle transparent objects (multiple depth layers) and sky regions (a dedicated “infinite‑depth” component).
- Broad Compatibility: Demonstrates that MDA can be plugged into a variety of backbone architectures (ResNet, Swin‑Transformer, etc.) with negligible runtime overhead.
- Extensive Empirical Validation: Shows consistent reductions in flying‑point artifacts across standard benchmarks (NYU‑Depth V2, KITTI) and under severe input blur.
Methodology
-
Mixture‑Density Head:
- The network’s final head predicts K depth candidates ({d_k}{k=1}^K) and a softmax weight vector ({w_k}{k=1}^K) for each pixel.
- The loss is a weighted combination of per‑component regression losses and a KL‑divergence term that encourages the weight distribution to reflect true ambiguity (e.g., two‑peak distribution at a boundary).
-
Training Strategy:
- Ground‑truth depth maps are used to generate soft target distributions: pixels on clean surfaces receive a single‑peak target, while pixels straddling a depth discontinuity receive a bimodal target derived from the two nearest surface depths.
- A small auxiliary edge detector guides the model to allocate multiple components where depth gradients are high.
-
Inference & Decoding:
- For each pixel, the component with the highest weight is selected as the final depth (hard‑max).
- In transparent or sky regions, a dedicated component is trained to output a sentinel value (e.g., “∞”) that is later rendered as a sky mask or layered depth.
-
Implementation Details:
- The mixture head adds < 2 % extra FLOPs compared with a vanilla single‑depth head.
- The approach is framework‑agnostic and can be attached to any encoder‑decoder depth network.
Results & Findings
| Dataset | Baseline (single depth) | MDA (K=3) | Flying‑point reduction |
|---|---|---|---|
| NYU‑Depth V2 (RMSE) | 0.58 m | 0.53 m | ~78 % fewer out‑of‑surface points |
| KITTI (Abs Rel) | 0.098 | 0.092 | ~85 % fewer boundary spikes |
| Synthetic blur test | 0.71 m | 0.64 m | Flying points virtually eliminated |
- Boundary Reconstruction: Edge‑wise depth error drops by ~30 % compared with the baseline, visibly sharpening object silhouettes.
- Transparent Objects: On a custom transparent‑object benchmark, MDA correctly predicts two depth layers for > 90 % of transparent pixels, whereas the baseline collapses to a single erroneous depth.
- Sky Handling: The dedicated sky component cleanly separates infinite‑depth sky from finite‑depth scene elements, removing the “floating” skyline artifacts common in prior work.
- Runtime: Adding the mixture head increases inference time by ~1 ms on a 1080 Ti GPU (≈0.5 % overhead).
Practical Implications
- AR/VR & Mixed Reality: More reliable depth at object edges means fewer visual glitches when compositing virtual objects into real scenes, especially for hand‑held devices that often capture blurry frames.
- Robotics & Autonomous Driving: Cleaner depth maps improve obstacle detection and path planning near thin structures (e.g., poles, fences) where flying points previously caused false positives.
- 3‑D Reconstruction & Photogrammetry: Accurate boundary depth reduces the need for post‑processing (e.g., edge‑aware smoothing), speeding up pipeline throughput for scanning and modeling.
- Transparent‑Object Perception: The multi‑layer output can be directly fed into downstream tasks like transparent‑object segmentation or material classification without extra heuristics.
- Minimal Integration Cost: Since MDA is a drop‑in head, existing depth‑estimation codebases can adopt it with a few lines of code and a modest GPU memory increase.
Limitations & Future Work
- Component Count Sensitivity: Choosing the right number of mixture components (K) is still heuristic; too few limits ambiguity modeling, while too many adds unnecessary memory.
- Training Data Requirements: The method relies on accurate edge annotations or synthetic depth discontinuity cues; performance may degrade on datasets lacking sharp boundary supervision.
- Hard‑max Decoding: Selecting the highest‑weight component discards the probabilistic richness of the mixture; future work could explore soft‑fusion or uncertainty‑aware downstream usage.
- Extension to Video: Temporal consistency of mixture weights across frames is not addressed; integrating a recurrent or optical‑flow‑guided module could further stabilize depth in video streams.
Overall, the mixture‑density representation offers a simple yet powerful fix to a long‑standing artifact in monocular depth estimation, opening the door for more robust 3‑D perception in real‑world applications.
Authors
- Siyuan Bian
- Congrong Xu
- Jun Gao
Paper Information
- arXiv ID: 2606.02552v1
- Categories: cs.CV, cs.AI
- Published: June 1, 2026
- PDF: Download PDF