[Paper] PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation

Published: 3 days ago (February 11, 2026 at 12:35 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.11066v1

Overview

The paper introduces PuriLight, a new self‑supervised framework for monocular depth estimation that stays tiny and fast while delivering high‑quality depth maps. By weaving together three novel modules, the authors show that you don’t have to sacrifice detail for efficiency—a common pain point for on‑device computer‑vision applications such as AR, robotics, and autonomous navigation.

Key Contributions

Three‑stage lightweight architecture that balances speed and structural precision.
Shuffle‑Dilation Convolution (SDC): a compact block that captures local context with dilated kernels and channel shuffling, reducing parameters compared to standard convolutions.
Rotation‑Adaptive Kernel Attention (RAKA): a hierarchical attention mechanism that dynamically re‑weights features based on learned rotation‑aware kernels, boosting representation power without heavy compute.
Deep Frequency Signal Purification (DFSP): a global‑frequency‑domain filter that cleans up noisy feature maps, improving depth continuity and edge sharpness.
State‑of‑the‑art results on standard self‑supervised depth benchmarks (KITTI, Make3D) with ~30 % fewer parameters and ≈2× faster inference than competing lightweight models.

Methodology

Input & Self‑Supervision – The network receives a single RGB frame and learns depth by minimizing photometric reprojection loss between consecutive video frames, a standard self‑supervised signal that eliminates the need for ground‑truth depth maps.
Stage 1 – Local Feature Extraction (SDC)
- Uses a shuffle operation to mix channel information, followed by dilated convolutions that expand the receptive field without extra parameters.
- Result: rich local textures and edge cues captured in a lightweight footprint.
Stage 2 – Hierarchical Feature Enhancement (RAKA)
- Builds a pyramid of feature maps at multiple scales.
- For each scale, a rotation‑adaptive kernel is learned; attention weights are computed by correlating these kernels with the feature map, allowing the network to focus on orientation‑consistent structures (e.g., road edges, building facades).
Stage 3 – Global Purification (DFSP)
- Transforms the feature map into the frequency domain (via a fast Fourier transform).
- A learned frequency mask suppresses high‑frequency noise while preserving structural frequencies, then the map is transformed back.
- This step sharpens depth discontinuities and reduces speckle artifacts common in lightweight models.
Depth Decoder – A lightweight up‑sampling decoder reconstructs the dense depth map from the purified features, followed by the usual scale‑invariant loss and smoothness regularization.

Results & Findings

Dataset	Params (M)	FLOPs (G)	Abs Rel ↓	δ<1.25 ↑
KITTI (self‑supervised)	1.8	2.1	0.098	0.89
Make3D	1.9	2.3	0.112	0.85

Accuracy: PuriLight matches or exceeds the best published lightweight methods (e.g., MobileDepth, FastDepth) while using ~30 % fewer parameters.
Speed: On a mid‑range mobile GPU (Qualcomm Adreno 640), inference runs at ≈45 fps (full‑resolution 640×192), enabling real‑time depth for AR/VR.
Ablation studies confirm each module’s contribution: removing DFSP degrades edge sharpness by ~12 %; swapping SDC for standard convolutions adds ~0.5 M parameters with negligible gain.

Practical Implications

On‑device AR/VR – Real‑time depth maps can be generated on smartphones or head‑mounted displays without draining battery or requiring a cloud backend.
Robotics & Drones – Lightweight depth estimation enables obstacle avoidance and navigation on compute‑constrained platforms (e.g., Raspberry Pi, Jetson Nano).
Autonomous Driving Edge Nodes – The low‑latency pipeline can complement LiDAR or radar by providing dense scene geometry where sensor coverage is sparse.
Developer Friendly – The authors release clean PyTorch code and a pre‑trained model, making it easy to plug into existing perception stacks or to fine‑tune on domain‑specific video data.

Limitations & Future Work

Domain Generalization – While self‑supervised training reduces dataset bias, the model still struggles with extreme lighting (night scenes) and highly reflective surfaces.
Resolution Trade‑off – The current design targets 640×192 inputs; scaling to higher resolutions incurs a linear increase in FLOPs, which may require additional pruning or quantization.
Future Directions suggested by the authors include exploring dynamic kernel generation for RAKA to handle unseen rotations, and integrating learnable frequency masks that adapt per scene for even better purification.

Authors

Yujie Chen
Li Zhang
Xiaomeng Chu
Tian Zhang

Paper Information

arXiv ID: 2602.11066v1
Categories: cs.CV
Published: February 11, 2026
PDF: Download PDF

[Paper] PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision