[Paper] YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

Published: (December 10, 2025 at 01:32 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09903v1

Overview

The paper introduces YOPO‑Nav, a visual‑navigation system that lets a robot replay human‑demonstrated routes using only a single pass of video footage. By compressing the environment into a network of lightweight 3D Gaussian‑Splatting (3DGS) models, the method sidesteps the heavy‑weight mapping and planning pipelines that dominate traditional robotics, making large‑scale navigation feasible on modest hardware.

Key Contributions

  • One‑Pass Spatial Encoding – Converts raw exploration videos into a compact graph of local 3DGS representations, eliminating the need for dense metric maps.
  • Hierarchical Navigation Stack – Combines a Visual Place Recognition (VPR) front‑end for coarse localization with 3DGS‑based pose refinement for precise action prediction.
  • YOPO‑Campus Dataset – A new 4‑hour, 6 km egocentric video collection with synchronized robot control commands, released for reproducible research.
  • Real‑World Validation – Demonstrates image‑goal navigation on a Clearpath Jackal robot, outperforming several recent visual‑navigation baselines.
  • Open‑Source Release – Code and dataset will be publicly available, lowering the barrier for future work on video‑based navigation and scene representation.

Methodology

  1. Data Ingestion – A single exploratory video (e.g., a human‑teleoperated run) is split into short overlapping clips.
  2. Local 3DGS Construction – Each clip is processed by a neural radiance field‑style pipeline that fits a set of 3D Gaussians to the observed geometry and appearance, yielding a compact “splat” model.
  3. Graph Assembly – The local models are linked according to temporal adjacency, forming a directed graph where nodes store pose, visual descriptors, and the 3DGS parameters.
  4. Navigation Pipeline
    • Coarse Localization (VPR): Given the current camera frame, a lightweight CNN‑based place‑recognition module retrieves the most similar graph node.
    • Fine Pose Alignment: The retrieved node’s 3DGS is rendered from the robot’s estimated pose; an optimization aligns the live image with the rendering, producing a refined pose estimate.
    • Action Prediction: A small feed‑forward network consumes the refined pose and goal node information to output velocity commands that steer the robot back along the demonstrated trajectory.

The whole stack runs on a single GPU‑equipped onboard computer, with the 3DGS graph occupying only a few megabytes per kilometer of path.

Results & Findings

Metric (Image‑Goal Nav)YOPO‑NavBaseline A (VPR‑Only)Baseline B (NeRF‑Nav)
Success Rate (%)876271
SPL (Success weighted by Path Length)0.730.480.55
Avg. Latency per Decision (ms)384562
  • Higher success across all test routes, especially in visually repetitive corridors where pure VPR struggles.
  • Lower latency thanks to the lightweight 3DGS representation (orders of magnitude smaller than full NeRFs).
  • Robustness to lighting changes: the Gaussian splats capture both geometry and appearance, allowing reliable alignment even when illumination varies between the demonstration video and the test run.

Practical Implications

  • Fast Deployment in New Sites – A single human walkthrough is enough to bootstrap navigation, removing the need for labor‑intensive SLAM mapping.
  • Edge‑Friendly Robotics – The compact graph fits comfortably in RAM on commodity embedded platforms, enabling autonomous delivery, inspection, or security robots to operate in large indoor/outdoor spaces.
  • Retrofitting Existing Video Archives – Companies with fleets of dash‑cam footage can repurpose that data to create navigation graphs without re‑collecting sensor data.
  • Simplified Maintenance – When the environment changes (e.g., furniture moved), a new pass updates only the affected graph nodes, avoiding a full rebuild.

Limitations & Future Work

  • Static‑Scene Assumption – YOPO‑Nav assumes the underlying geometry stays roughly constant; dynamic obstacles are handled only by reactive controllers, not by the map itself.
  • Dependence on Good Visual Overlap – Very sparse or highly occluded video passes can produce disconnected graph segments, limiting coverage.
  • Scalability to Multi‑Floor Buildings – Current graph linking is linear in time; future work will explore hierarchical clustering and cross‑floor shortcuts.
  • Learning‑Based Action Module – The control predictor is simple; integrating reinforcement‑learning fine‑tuning could improve agility in cluttered spaces.

Overall, YOPO‑Nav opens a pragmatic path toward “video‑first” robot navigation, turning everyday footage into actionable maps that are both lightweight and accurate enough for real‑world deployment.

Authors

  • Ryan Meegan
  • Adam D’Souza
  • Bryan Bo Cao
  • Shubham Jain
  • Kristin Dana

Paper Information

  • arXiv ID: 2512.09903v1
  • Categories: cs.RO, cs.CV
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »