[Paper] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM
Source: arXiv - 2512.25008v1
Overview
FoundationSLAM introduces a fully learning‑based monocular dense SLAM pipeline that finally marries optical‑flow style matching with solid geometric reasoning. By pulling in the “foundation” depth models that have been trained on massive image collections, the system delivers accurate camera tracking and high‑fidelity dense maps in real‑time, closing a long‑standing gap between data‑driven matching and classic multi‑view geometry.
Key Contributions
- Hybrid Flow Network: A novel neural architecture that produces geometry‑aware correspondences, allowing depth and pose to be inferred consistently across keyframes.
- Bi‑Consistent Bundle Adjustment (BA) Layer: A differentiable BA module that jointly refines keyframe poses and per‑pixel depths under multi‑view constraints, enforcing global consistency during inference.
- Reliability‑Aware Refinement: A dynamic mechanism that classifies flow predictions into reliable vs. uncertain regions and adapts the update step accordingly, creating a closed feedback loop between matching and optimization.
- Real‑time Performance: End‑to‑end system runs at ~18 FPS on a single RTX‑3080, making dense SLAM viable for on‑device robotics and AR/VR workloads.
- Strong Generalization: Demonstrated superior trajectory accuracy and dense reconstruction quality on several benchmark datasets (e.g., TUM‑RGBD, ScanNet, EuRoC) without dataset‑specific fine‑tuning.
Methodology
- Foundation Depth Backbone – The pipeline starts with a pre‑trained depth foundation model (e.g., MiDaS‑large) that provides an initial dense depth prior for each incoming frame.
- Hybrid Flow Network – The network takes the current RGB frame, the depth prior, and the previous keyframe as inputs and predicts a hybrid flow field that is explicitly conditioned on depth. This yields correspondences that respect scene geometry rather than pure photometric similarity.
- Bi‑Consistent Bundle Adjustment Layer – The predicted correspondences feed into a differentiable BA module that simultaneously optimizes the camera pose of the new keyframe and refines the dense depth map. Multi‑view reprojection errors are minimized across all active keyframes, ensuring global consistency.
- Reliability‑Aware Refinement – After BA, each pixel’s flow residual is examined. Pixels with low residuals are marked reliable and kept unchanged; high‑residual pixels are treated as uncertain and are re‑estimated by the flow network in a second pass. This loop repeats until convergence or a fixed iteration budget.
- Map Fusion & Output – Refined depth maps are fused into a global TSDF (truncated signed distance function) volume, producing a dense 3‑D reconstruction that can be queried for downstream tasks (e.g., collision checking, scene understanding).
Results & Findings
| Dataset | Trajectory RMSE (m) | Dense Reconstruction F‑score | FPS |
|---|---|---|---|
| TUM‑RGBD (fr1/desk) | 0.018 (↓ 32% vs. prior flow‑SLAM) | 0.84 (↑ 9%) | 18 |
| ScanNet (scene‑018) | 0.025 (↓ 28%) | 0.81 (↑ 11%) | 18 |
| EuRoC MAV (V1_01) | 0.034 (↓ 30%) | 0.78 (↑ 10%) | 18 |
- Trajectory accuracy improves consistently across indoor and semi‑outdoor sequences, confirming that geometry‑aware flow reduces drift.
- Dense map quality (F‑score against ground‑truth meshes) surpasses prior learning‑based SLAM systems that rely solely on optical flow or depth prediction.
- Real‑time capability is maintained thanks to the lightweight hybrid flow network and the efficient, GPU‑accelerated BA layer.
- Generalization tests on unseen environments (e.g., handheld video, drone footage) show only minor performance drops, indicating that the foundation depth prior successfully transfers across domains.
Practical Implications
- Robotics & Drones – Developers can integrate FoundationSLM into navigation stacks to obtain both accurate pose estimates and dense obstacle maps from a single monocular camera, reducing hardware cost and payload.
- AR/VR Experiences – Real‑time dense reconstruction enables on‑device scene meshing for occlusion handling, physics interaction, and persistent world anchors without needing depth sensors.
- 3‑D Scanning Apps – Mobile developers can deliver high‑quality mesh capture using only the phone’s RGB camera, leveraging the pre‑trained depth backbone that already runs efficiently on mobile GPUs.
- Cross‑modal Perception – The reliability‑aware loop provides a natural hook for fusing other modalities (e.g., IMU, LiDAR) by feeding their confidence scores into the refinement stage, opening paths to hybrid sensor fusion pipelines.
- Open‑source Potential – Because the core components (Hybrid Flow Net, differentiable BA) are built in PyTorch/CUDA, the system can be extended or pruned for edge devices, encouraging community contributions and domain‑specific customizations.
Limitations & Future Work
- Depth Prior Dependency – The quality of the initial foundation depth model still caps the ultimate reconstruction fidelity; extreme lighting or reflective surfaces can still produce outliers.
- Memory Footprint – Maintaining dense depth for multiple active keyframes and a TSDF volume consumes several gigabytes of GPU memory, which may be prohibitive for low‑end embedded platforms.
- Dynamic Scenes – The current formulation assumes static geometry; moving objects are treated as unreliable regions but are not explicitly modeled, limiting performance in highly dynamic environments.
- Future Directions – Authors suggest (i) integrating learned motion segmentation to handle dynamics, (ii) exploring lightweight depth backbones for mobile deployment, and (iii) extending the BA layer to jointly optimize learned scene semantics alongside geometry.
Authors
- Yuchen Wu
- Jiahe Li
- Fabio Tosi
- Matteo Poggi
- Jin Zheng
- Xiao Bai
Paper Information
- arXiv ID: 2512.25008v1
- Categories: cs.CV
- Published: December 31, 2025
- PDF: Download PDF