[Paper] CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Published: 2 days ago (April 21, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.19741v1

Overview

CityRAG introduces a new way to generate long, photorealistic video footage that is spatially grounded to a real‑world city. By tying a generative model to massive collections of geo‑registered imagery (e.g., street‑view panoramas, satellite maps, traffic cams), the system can synthesize minutes‑long, 3‑D‑consistent video that respects the actual layout of streets, buildings, and landmarks—while still allowing flexible control over weather, lighting, and dynamic objects. This bridges the gap between “creative” video synthesis and the need for realistic, navigable environments in autonomous‑driving, robotics, and virtual‑city simulations.

Key Contributions

Spatial grounding via geo‑registered context: CityRAG ingests large, unaligned datasets (street‑view, satellite, aerial) to anchor generated frames to a specific physical location.
Temporal disentanglement of scene vs. transient attributes: The model learns to separate permanent geometry (roads, buildings) from changeable factors (weather, traffic, time of day).
Long‑duration, loop‑closed video generation: Demonstrates coherent generation over thousands of frames, maintaining consistent lighting and weather, and supporting loop closure for navigation.
Trajectory‑driven navigation: Users can specify arbitrary camera paths (e.g., driving routes) and the model renders a video that faithfully follows the underlying city map.
Training on unaligned data: No need for synchronized video streams; the system leverages existing, loosely timed geo‑tagged imagery, dramatically reducing data collection overhead.

Methodology

Data Backbone: CityRAG builds a multi‑modal database of geo‑registered assets:
- Static maps (satellite orthophotos, GIS road graphs) provide the immutable layout.
- Dynamic imagery (street‑view panoramas, dash‑cam clips) supplies appearance cues under various conditions.
Scene Encoder: A transformer‑based encoder ingests the static map and extracts a spatial embedding for each 3‑D coordinate. This embedding acts as a “scene fingerprint” that remains constant across time.
Attribute Decoder: A separate diffusion‑style decoder receives the spatial embedding plus a condition vector (weather, time of day, traffic density). Because training data are temporally unaligned, the decoder learns to apply the condition vector only to transient visual aspects, leaving the underlying geometry untouched.
Trajectory Conditioning: Users provide a sequence of GPS waypoints or a parametric path. The system samples the corresponding spatial embeddings along the path and feeds them to the decoder frame‑by‑frame, stitching the outputs into a smooth video.
Loop Closure & Consistency: A self‑supervised loss penalizes drift in the latent space when the camera returns to a previously visited location, encouraging the model to produce identical frames for the same spatial coordinate regardless of when they are rendered.

Results & Findings

Coherent minutes‑long videos: CityRAG generated videos up to 5 minutes (≈ 9 000 frames) without noticeable flicker or geometry distortion.
Weather & lighting persistence: When conditioned on “rainy night,” the model maintained rain streaks, wet surfaces, and low‑light shading consistently across the entire sequence.
Loop closure success: In a test where the virtual camera completed a city block loop, the start and end frames matched pixel‑wise within 2 % error, confirming spatial grounding.
Complex trajectory handling: The model navigated sharp turns, elevation changes, and occlusions (e.g., passing under bridges) while preserving the correct perspective and depth cues.
Quantitative metrics: Compared to baseline text‑to‑video diffusion models, CityRAG improved structural similarity (SSIM) to ground‑truth street‑view footage by 18 % and reduced temporal inconsistency (measured by optical‑flow variance) by 27 %.

Practical Implications

Autonomous‑vehicle simulation: Engineers can generate endless, photorealistic driving scenarios for perception stack testing without manually building 3‑D assets or capturing new footage.
Robotics & SLAM research: Spatially grounded video provides a cheap source of synthetic yet realistic data for training and evaluating localization and mapping algorithms.
Urban planning & VR tourism: Planners can preview how a proposed street redesign would look under different weather conditions, while VR platforms can stream “live” city tours without storing massive video files.
Data augmentation: Existing datasets (e.g., Waymo Open Dataset) can be expanded with synthetic variations (night, fog, heavy traffic) that remain faithful to the original map geometry, improving model robustness.

Limitations & Future Work

Resolution ceiling: Current experiments are limited to 512 × 512 pixels; scaling to 4K for high‑fidelity simulation will require more efficient diffusion architectures.
Dynamic object realism: While weather and lighting are well modeled, moving agents (cars, pedestrians) are generated as static textures; integrating physics‑based agents remains an open challenge.
Geographic bias: The model performs best in regions densely covered by geo‑registered imagery (e.g., North America, Europe). Extending to under‑represented cities will need better data collection pipelines.
Real‑time inference: Generation still takes several seconds per frame; future work will explore latent‑space caching and GPU‑accelerated diffusion to enable interactive navigation.

CityRAG marks a significant step toward bridging generative video models and real‑world spatial fidelity, opening new avenues for developers building next‑generation simulation, training, and immersive experiences.

Authors

Gene Chou
Charles Herrmann
Kyle Genova
Boyang Deng
Songyou Peng
Bharath Hariharan
Jason Y. Zhang
Noah Snavely
Philipp Henzler

Paper Information

arXiv ID: 2604.19741v1
Categories: cs.CV
Published: April 21, 2026
PDF: Download PDF

[Paper] CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

[Paper] Context Unrolling in Omni Models

[Paper] Vista4D: Video Reshooting with 4D Point Clouds