[Paper] PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation
Source: arXiv - 2602.11066v1
Overview
The paper introduces PuriLight, a new self‑supervised framework for monocular depth estimation that stays tiny and fast while delivering high‑quality depth maps. By weaving together three novel modules, the authors show that you don’t have to sacrifice detail for efficiency—a common pain point for on‑device computer‑vision applications such as AR, robotics, and autonomous navigation.
Key Contributions
- Three‑stage lightweight architecture that balances speed and structural precision.
- Shuffle‑Dilation Convolution (SDC): a compact block that captures local context with dilated kernels and channel shuffling, reducing parameters compared to standard convolutions.
- Rotation‑Adaptive Kernel Attention (RAKA): a hierarchical attention mechanism that dynamically re‑weights features based on learned rotation‑aware kernels, boosting representation power without heavy compute.
- Deep Frequency Signal Purification (DFSP): a global‑frequency‑domain filter that cleans up noisy feature maps, improving depth continuity and edge sharpness.
- State‑of‑the‑art results on standard self‑supervised depth benchmarks (KITTI, Make3D) with ~30 % fewer parameters and ≈2× faster inference than competing lightweight models.
Methodology
-
Input & Self‑Supervision – The network receives a single RGB frame and learns depth by minimizing photometric reprojection loss between consecutive video frames, a standard self‑supervised signal that eliminates the need for ground‑truth depth maps.
-
Stage 1 – Local Feature Extraction (SDC)
- Uses a shuffle operation to mix channel information, followed by dilated convolutions that expand the receptive field without extra parameters.
- Result: rich local textures and edge cues captured in a lightweight footprint.
-
Stage 2 – Hierarchical Feature Enhancement (RAKA)
- Builds a pyramid of feature maps at multiple scales.
- For each scale, a rotation‑adaptive kernel is learned; attention weights are computed by correlating these kernels with the feature map, allowing the network to focus on orientation‑consistent structures (e.g., road edges, building facades).
-
Stage 3 – Global Purification (DFSP)
- Transforms the feature map into the frequency domain (via a fast Fourier transform).
- A learned frequency mask suppresses high‑frequency noise while preserving structural frequencies, then the map is transformed back.
- This step sharpens depth discontinuities and reduces speckle artifacts common in lightweight models.
-
Depth Decoder – A lightweight up‑sampling decoder reconstructs the dense depth map from the purified features, followed by the usual scale‑invariant loss and smoothness regularization.
Results & Findings
| Dataset | Params (M) | FLOPs (G) | Abs Rel ↓ | δ<1.25 ↑ |
|---|---|---|---|---|
| KITTI (self‑supervised) | 1.8 | 2.1 | 0.098 | 0.89 |
| Make3D | 1.9 | 2.3 | 0.112 | 0.85 |
- Accuracy: PuriLight matches or exceeds the best published lightweight methods (e.g., MobileDepth, FastDepth) while using ~30 % fewer parameters.
- Speed: On a mid‑range mobile GPU (Qualcomm Adreno 640), inference runs at ≈45 fps (full‑resolution 640×192), enabling real‑time depth for AR/VR.
- Ablation studies confirm each module’s contribution: removing DFSP degrades edge sharpness by ~12 %; swapping SDC for standard convolutions adds ~0.5 M parameters with negligible gain.
Practical Implications
- On‑device AR/VR – Real‑time depth maps can be generated on smartphones or head‑mounted displays without draining battery or requiring a cloud backend.
- Robotics & Drones – Lightweight depth estimation enables obstacle avoidance and navigation on compute‑constrained platforms (e.g., Raspberry Pi, Jetson Nano).
- Autonomous Driving Edge Nodes – The low‑latency pipeline can complement LiDAR or radar by providing dense scene geometry where sensor coverage is sparse.
- Developer Friendly – The authors release clean PyTorch code and a pre‑trained model, making it easy to plug into existing perception stacks or to fine‑tune on domain‑specific video data.
Limitations & Future Work
- Domain Generalization – While self‑supervised training reduces dataset bias, the model still struggles with extreme lighting (night scenes) and highly reflective surfaces.
- Resolution Trade‑off – The current design targets 640×192 inputs; scaling to higher resolutions incurs a linear increase in FLOPs, which may require additional pruning or quantization.
- Future Directions suggested by the authors include exploring dynamic kernel generation for RAKA to handle unseen rotations, and integrating learnable frequency masks that adapt per scene for even better purification.
Authors
- Yujie Chen
- Li Zhang
- Xiaomeng Chu
- Tian Zhang
Paper Information
- arXiv ID: 2602.11066v1
- Categories: cs.CV
- Published: February 11, 2026
- PDF: Download PDF