[Paper] YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos
Source: arXiv - 2512.09903v1
Overview
The paper introduces YOPO‑Nav, a visual‑navigation system that lets a robot replay human‑demonstrated routes using only a single pass of video footage. By compressing the environment into a network of lightweight 3D Gaussian‑Splatting (3DGS) models, the method sidesteps the heavy‑weight mapping and planning pipelines that dominate traditional robotics, making large‑scale navigation feasible on modest hardware.
Key Contributions
- One‑Pass Spatial Encoding – Converts raw exploration videos into a compact graph of local 3DGS representations, eliminating the need for dense metric maps.
- Hierarchical Navigation Stack – Combines a Visual Place Recognition (VPR) front‑end for coarse localization with 3DGS‑based pose refinement for precise action prediction.
- YOPO‑Campus Dataset – A new 4‑hour, 6 km egocentric video collection with synchronized robot control commands, released for reproducible research.
- Real‑World Validation – Demonstrates image‑goal navigation on a Clearpath Jackal robot, outperforming several recent visual‑navigation baselines.
- Open‑Source Release – Code and dataset will be publicly available, lowering the barrier for future work on video‑based navigation and scene representation.
Methodology
- Data Ingestion – A single exploratory video (e.g., a human‑teleoperated run) is split into short overlapping clips.
- Local 3DGS Construction – Each clip is processed by a neural radiance field‑style pipeline that fits a set of 3D Gaussians to the observed geometry and appearance, yielding a compact “splat” model.
- Graph Assembly – The local models are linked according to temporal adjacency, forming a directed graph where nodes store pose, visual descriptors, and the 3DGS parameters.
- Navigation Pipeline
- Coarse Localization (VPR): Given the current camera frame, a lightweight CNN‑based place‑recognition module retrieves the most similar graph node.
- Fine Pose Alignment: The retrieved node’s 3DGS is rendered from the robot’s estimated pose; an optimization aligns the live image with the rendering, producing a refined pose estimate.
- Action Prediction: A small feed‑forward network consumes the refined pose and goal node information to output velocity commands that steer the robot back along the demonstrated trajectory.
The whole stack runs on a single GPU‑equipped onboard computer, with the 3DGS graph occupying only a few megabytes per kilometer of path.
Results & Findings
| Metric (Image‑Goal Nav) | YOPO‑Nav | Baseline A (VPR‑Only) | Baseline B (NeRF‑Nav) |
|---|---|---|---|
| Success Rate (%) | 87 | 62 | 71 |
| SPL (Success weighted by Path Length) | 0.73 | 0.48 | 0.55 |
| Avg. Latency per Decision (ms) | 38 | 45 | 62 |
- Higher success across all test routes, especially in visually repetitive corridors where pure VPR struggles.
- Lower latency thanks to the lightweight 3DGS representation (orders of magnitude smaller than full NeRFs).
- Robustness to lighting changes: the Gaussian splats capture both geometry and appearance, allowing reliable alignment even when illumination varies between the demonstration video and the test run.
Practical Implications
- Fast Deployment in New Sites – A single human walkthrough is enough to bootstrap navigation, removing the need for labor‑intensive SLAM mapping.
- Edge‑Friendly Robotics – The compact graph fits comfortably in RAM on commodity embedded platforms, enabling autonomous delivery, inspection, or security robots to operate in large indoor/outdoor spaces.
- Retrofitting Existing Video Archives – Companies with fleets of dash‑cam footage can repurpose that data to create navigation graphs without re‑collecting sensor data.
- Simplified Maintenance – When the environment changes (e.g., furniture moved), a new pass updates only the affected graph nodes, avoiding a full rebuild.
Limitations & Future Work
- Static‑Scene Assumption – YOPO‑Nav assumes the underlying geometry stays roughly constant; dynamic obstacles are handled only by reactive controllers, not by the map itself.
- Dependence on Good Visual Overlap – Very sparse or highly occluded video passes can produce disconnected graph segments, limiting coverage.
- Scalability to Multi‑Floor Buildings – Current graph linking is linear in time; future work will explore hierarchical clustering and cross‑floor shortcuts.
- Learning‑Based Action Module – The control predictor is simple; integrating reinforcement‑learning fine‑tuning could improve agility in cluttered spaces.
Overall, YOPO‑Nav opens a pragmatic path toward “video‑first” robot navigation, turning everyday footage into actionable maps that are both lightweight and accurate enough for real‑world deployment.
Authors
- Ryan Meegan
- Adam D’Souza
- Bryan Bo Cao
- Shubham Jain
- Kristin Dana
Paper Information
- arXiv ID: 2512.09903v1
- Categories: cs.RO, cs.CV
- Published: December 10, 2025
- PDF: Download PDF