[Paper] Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

Published: 1 month ago (December 11, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10956v1

Overview

The paper introduces StereoWalker, a robot‑navigation foundation model that combines stereo camera input with explicit mid‑level vision (depth estimation and dense pixel tracking). By leveraging these richer visual cues, the authors demonstrate that navigation in dynamic, unstructured urban environments can be learned with far less data and higher accuracy than existing monocular‑only approaches.

Key Contributions

Stereo‑augmented navigation model: Extends end‑to‑end navigation foundations to ingest synchronized left‑right images, eliminating the depth‑scale ambiguity inherent to monocular vision.
Mid‑level vision integration: Incorporates off‑the‑shelf depth and dense‑tracking modules as explicit inputs, providing geometric and motion priors to the policy network.
Large‑scale stereo navigation dataset: Curates a new dataset of Internet‑sourced stereo video clips with automatically generated action labels, released for community use.
Data‑efficiency breakthrough: Shows that StereoWalker reaches state‑of‑the‑art performance with only 1.5 % of the training data required by prior monocular models.
Empirical superiority: With the full dataset, StereoWalker outperforms the current best monocular navigation baselines across multiple dynamic‑scene benchmarks.

Methodology

1. Data Collection & Annotation

Harvested thousands of publicly available stereo video sequences (e.g., from YouTube 3‑D content).
Applied a heuristic controller (e.g., visual‑odometry‑based waypoint following) to generate pseudo‑ground‑truth navigation actions, yielding a self‑supervised training signal.

2. Mid‑Level Vision Modules

Depth Estimation: A pre‑trained stereo disparity network (e.g., RAFT‑Stereo) produces per‑pixel depth maps.
Dense Pixel Tracking: A modern optical‑flow model (e.g., RAFT‑Flow) supplies pixel‑wise motion vectors across frames.
Both outputs are concatenated with the raw left‑right RGB frames, forming a multi‑channel observation tensor.

3. Policy Architecture

A convolutional encoder processes the stacked observation, extracting a compact latent representation.
A recurrent core (GRU) captures temporal dependencies, crucial for dynamic obstacles.
A lightweight MLP head maps the hidden state to continuous control commands (linear & angular velocity).

4. Training Regime

Supervised imitation learning using the generated action labels.
Curriculum learning: start with static scenes, progressively introduce more dynamic traffic and pedestrians.
Data‑augmentation (random cropping, illumination jitter) to improve robustness.

5. Evaluation

Benchmarked on two urban navigation simulators (CARLA‑Dynamic and Habitat‑Urban) with moving agents and varying lighting.
Metrics: success rate (reaching goal), collision rate, trajectory efficiency, and sample efficiency (performance vs. training data size).

Results & Findings

Setting	Success Rate	Collision Rate	Data Used
StereoWalker (full data)	92 %	4 %	100 %
Mono‑only NFM (baseline)	84 %	9 %	100 %
StereoWalker (1.5 % data)	89 %	5 %	1.5 %
StereoWalker (no mid‑level)	78 %	12 %	100 %

Stereo input alone already beats monocular baselines, confirming that depth‑scale resolution is a major factor.
Adding depth + flow yields the biggest jump, especially in crowded scenes where motion cues help predict pedestrian trajectories.
Sample efficiency: With only 1.5 % of the data, StereoWalker matches the full‑data performance of the monocular state‑of‑the‑art model, highlighting the value of explicit geometric priors.

Practical Implications

Reduced data collection costs: Developers can train competent navigation policies with a fraction of the video data traditionally required, lowering storage and annotation overhead.
Hardware feasibility: Stereo cameras are now inexpensive and widely supported on mobile robots and autonomous vehicles; integrating them yields immediate performance gains without redesigning the entire perception stack.
Modular system design: By treating depth and flow as plug‑and‑play modules, existing robotics pipelines can adopt StereoWalker without retraining low‑level perception networks.
Improved safety in dynamic environments: Explicit motion understanding helps anticipate moving obstacles, a critical requirement for delivery robots, warehouse AGVs, and last‑mile autonomous vehicles.
Open dataset & benchmark: The released stereo navigation dataset provides a new standard for evaluating future navigation foundation models, encouraging community‑driven progress.

Limitations & Future Work

Reliance on calibrated stereo rigs: Mis‑alignment or baseline drift can degrade depth quality; the paper assumes well‑calibrated hardware.
Synthetic action labels: The pseudo‑ground‑truth actions are generated by a heuristic controller, which may not capture expert human strategies; real‑world demonstrations could further improve policy quality.
Domain gap: Training on Internet stereo videos (often indoor or cinematic) may not fully represent the sensor noise and lighting conditions of real urban deployments.
Scalability to higher‑level reasoning: The current model focuses on low‑level control; extending it to incorporate semantic maps or long‑term planning remains an open challenge.

Authors

Wentao Zhou
Xuweiyi Chen
Vignesh Rajagopal
Jeffrey Chen
Rohan Chandra
Zezhou Cheng

Paper Information

arXiv ID: 2512.10956v1
Categories: cs.CV
Published: December 11, 2025
PDF: Download PDF