[Paper] 360DVO: Deep Visual Odometry for Monocular 360-Degree Camera
Source: arXiv - 2601.02309v1
Overview
The paper introduces 360DVO, the first deep‑learning‑driven visual odometry (VO) system built for monocular 360° cameras. By learning distortion‑aware features and integrating them into a differentiable bundle‑adjustment pipeline, the authors achieve far‑greater robustness and accuracy than classic handcrafted‑feature or photometric methods—especially in aggressive motion and challenging lighting.
Key Contributions
- Distortion‑Aware Spherical Feature Extractor (DAS‑Feat) – a CNN that learns to produce sparse, distortion‑resistant feature patches directly on equirectangular 360° images.
- Omnidirectional Differentiable Bundle Adjustment (ODBA) – a novel, end‑to‑end trainable pose‑estimation module that optimizes camera motion using the learned spherical features.
- Real‑world OVO benchmark – a newly collected dataset of handheld and vehicle‑mounted 360° sequences with ground‑truth poses, filling a gap in realistic evaluation resources.
- State‑of‑the‑art performance – on both the new benchmark and existing synthetic suites (TartanAir V2, 360VO), 360DVO improves robustness by ≈ 50 % and trajectory error by ≈ 37.5 % over the strongest baselines (360VO, OpenVSLAM).
Methodology
- Input preprocessing – Raw equirectangular frames are fed to a lightweight CNN. Unlike standard planar feature nets, DAS‑Feat incorporates a spherical distortion map that tells the network how pixel density varies with latitude, allowing it to focus on regions that remain informative after projection.
- Sparse feature selection – The network outputs a set of keypoint locations and associated descriptors. Because the features are learned, they naturally become invariant to the stretching that occurs near the poles of a 360° image.
- Omnidirectional Bundle Adjustment – The selected features from consecutive frames are matched, and the resulting correspondences are fed into ODBA. This module formulates the classic bundle‑adjustment cost (re‑projection error) on the unit sphere and differentiates it with respect to the camera pose. The whole pipeline (DAS‑Feat + ODBA) can be trained end‑to‑end using a combination of supervised pose loss and self‑supervised photometric consistency.
- Training & inference – The model is first pre‑trained on synthetic 360° datasets (where perfect ground‑truth is cheap) and then fine‑tuned on the new real‑world benchmark to close the domain gap. At runtime, only the forward pass of DAS‑Feat and a few Gauss‑Newton iterations of ODBA are required, keeping the system real‑time on a modern GPU.
Results & Findings
| Dataset | Metric (RMSE % of trajectory) | 360DVO vs. 360VO | 360DVO vs. OpenVSLAM |
|---|---|---|---|
| Real‑world OVO benchmark | 0.42 % | +37.5 % lower error | +45 % lower error |
| TartanAir V2 | 0.38 % | +35 % | +40 % |
| 360VO (synthetic) | 0.45 % | +30 % | +38 % |
- Robustness gain: In sequences with rapid rotations (> 300 °/s) or strong illumination changes, the failure rate drops from ~22 % (baseline) to < 10 %.
- Feature quality: Visualizations show DAS‑Feat concentrates points around texture‑rich regions (e.g., building edges) while avoiding the heavily stretched polar caps.
- Runtime: On an RTX 3080, the full pipeline runs at ~30 fps for 1024 × 2048 equirectangular frames, comparable to classic VO pipelines that rely on CPU‑only feature extraction.
Practical Implications
- Robotics & autonomous navigation – 360° cameras are cheap and provide full situational awareness. 360DVO enables reliable pose tracking without expensive LiDAR, making it attractive for indoor drones, warehouse robots, or low‑cost delivery bots.
- AR/VR content creation – Accurate camera trajectories are essential for stitching 360° video or generating spatial audio. The learned features stay stable even when the operator swings the camera quickly, reducing post‑processing drift.
- Mapping & inspection – For handheld or vehicle‑mounted inspection rigs (e.g., pipelines, construction sites), 360DVO can deliver continuous odometry where GPS is unavailable, feeding directly into SLAM back‑ends.
- Edge deployment – Because the feature extractor is lightweight and the bundle‑adjustment step is a few matrix solves, the system can be ported to embedded GPUs (Jetson, i.MX) for on‑device navigation without cloud reliance.
Limitations & Future Work
- Domain sensitivity – Although fine‑tuning mitigates it, the model still struggles with extreme weather (rain, fog) that heavily attenuates the 360° image contrast.
- Scale ambiguity – As with any monocular VO, absolute scale must be supplied (e.g., from an IMU or known object size). Integrating inertial data could close this gap.
- Sparse feature reliance – Very texture‑less environments (e.g., long corridors) still cause feature starvation; future work may explore dense, learned photometric losses in tandem with DAS‑Feat.
- Benchmark breadth – The new real‑world dataset focuses on urban and indoor scenes; expanding to outdoor, high‑speed vehicular scenarios would further validate the approach.
360DVO marks a significant step toward making 360° visual odometry practical for real‑world applications, marrying the flexibility of deep feature learning with the rigor of classic bundle adjustment.
Authors
- Xiaopeng Guo
- Yinzhe Xu
- Huajian Huang
- Sai‑Kit Yeung
Paper Information
- arXiv ID: 2601.02309v1
- Categories: cs.CV
- Published: January 5, 2026
- PDF: Download PDF