[Paper] YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

Published: 2 months ago (December 10, 2025 at 01:32 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09903v1

Overview

The paper introduces YOPO‑Nav, a visual‑navigation system that lets a robot replay human‑demonstrated routes using only a single pass of video footage. By compressing the environment into a network of lightweight 3D Gaussian‑Splatting (3DGS) models, the method sidesteps the heavy‑weight mapping and planning pipelines that dominate traditional robotics, making large‑scale navigation feasible on modest hardware.

Key Contributions

One‑Pass Spatial Encoding – Converts raw exploration videos into a compact graph of local 3DGS representations, eliminating the need for dense metric maps.
Hierarchical Navigation Stack – Combines a Visual Place Recognition (VPR) front‑end for coarse localization with 3DGS‑based pose refinement for precise action prediction.
YOPO‑Campus Dataset – A new 4‑hour, 6 km egocentric video collection with synchronized robot control commands, released for reproducible research.
Real‑World Validation – Demonstrates image‑goal navigation on a Clearpath Jackal robot, outperforming several recent visual‑navigation baselines.
Open‑Source Release – Code and dataset will be publicly available, lowering the barrier for future work on video‑based navigation and scene representation.

Methodology

Data Ingestion – A single exploratory video (e.g., a human‑teleoperated run) is split into short overlapping clips.
Local 3DGS Construction – Each clip is processed by a neural radiance field‑style pipeline that fits a set of 3D Gaussians to the observed geometry and appearance, yielding a compact “splat” model.
Graph Assembly – The local models are linked according to temporal adjacency, forming a directed graph where nodes store pose, visual descriptors, and the 3DGS parameters.
Navigation Pipeline
- Coarse Localization (VPR): Given the current camera frame, a lightweight CNN‑based place‑recognition module retrieves the most similar graph node.
- Fine Pose Alignment: The retrieved node’s 3DGS is rendered from the robot’s estimated pose; an optimization aligns the live image with the rendering, producing a refined pose estimate.
- Action Prediction: A small feed‑forward network consumes the refined pose and goal node information to output velocity commands that steer the robot back along the demonstrated trajectory.

The whole stack runs on a single GPU‑equipped onboard computer, with the 3DGS graph occupying only a few megabytes per kilometer of path.

Results & Findings

Metric (Image‑Goal Nav)	YOPO‑Nav	Baseline A (VPR‑Only)	Baseline B (NeRF‑Nav)
Success Rate (%)	87	62	71
SPL (Success weighted by Path Length)	0.73	0.48	0.55
Avg. Latency per Decision (ms)	38	45	62

Higher success across all test routes, especially in visually repetitive corridors where pure VPR struggles.
Lower latency thanks to the lightweight 3DGS representation (orders of magnitude smaller than full NeRFs).
Robustness to lighting changes: the Gaussian splats capture both geometry and appearance, allowing reliable alignment even when illumination varies between the demonstration video and the test run.

Practical Implications

Fast Deployment in New Sites – A single human walkthrough is enough to bootstrap navigation, removing the need for labor‑intensive SLAM mapping.
Edge‑Friendly Robotics – The compact graph fits comfortably in RAM on commodity embedded platforms, enabling autonomous delivery, inspection, or security robots to operate in large indoor/outdoor spaces.
Retrofitting Existing Video Archives – Companies with fleets of dash‑cam footage can repurpose that data to create navigation graphs without re‑collecting sensor data.
Simplified Maintenance – When the environment changes (e.g., furniture moved), a new pass updates only the affected graph nodes, avoiding a full rebuild.

Limitations & Future Work

Static‑Scene Assumption – YOPO‑Nav assumes the underlying geometry stays roughly constant; dynamic obstacles are handled only by reactive controllers, not by the map itself.
Dependence on Good Visual Overlap – Very sparse or highly occluded video passes can produce disconnected graph segments, limiting coverage.
Scalability to Multi‑Floor Buildings – Current graph linking is linear in time; future work will explore hierarchical clustering and cross‑floor shortcuts.
Learning‑Based Action Module – The control predictor is simple; integrating reinforcement‑learning fine‑tuning could improve agility in cluttered spaces.

Overall, YOPO‑Nav opens a pragmatic path toward “video‑first” robot navigation, turning everyday footage into actionable maps that are both lightweight and accurate enough for real‑world deployment.

Authors

Ryan Meegan
Adam D’Souza
Bryan Bo Cao
Shubham Jain
Kristin Dana

Paper Information

arXiv ID: 2512.09903v1
Categories: cs.RO, cs.CV
Published: December 10, 2025
PDF: Download PDF

[Paper] YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

[Paper] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis