[Paper] Generative Scenario Rollouts for End-to-End Autonomous Driving

Published: 3 weeks ago (January 16, 2026 at 12:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.11475v1

Overview

The paper introduces Generative Scenario Rollouts (GeRo), a plug‑and‑play framework that extends vision‑language‑action (VLA) models from pure imitation learning to a generative, language‑conditioned planner for autonomous driving. By letting the model imagine future traffic scenes and answer “what‑if” questions, GeRo achieves more reliable long‑horizon decisions while keeping the reasoning traceable through natural‑language descriptions.

Key Contributions

Joint planning & scene generation: Trains a VLA model to encode both ego‑vehicle and surrounding agents into latent tokens that can be used for action prediction and for autoregressive generation of future scenes.
Language‑grounded rollouts: Introduces a rollout‑consistency loss that aligns generated latent tokens with textual scenario descriptions, reducing drift over long horizons.
Plug‑and‑play architecture: GeRo can be attached to existing VLA backbones without redesigning the perception or control stacks.
Reinforcement‑learning integration: Combines generative rollouts with RL fine‑tuning, yielding state‑of‑the‑art performance on the Bench2Drive benchmark (↑15.7 driving score, ↑26.2 % success rate).
Zero‑shot robustness: Demonstrates that language‑conditioned reasoning improves performance on unseen traffic configurations and weather conditions.

Methodology

Tokenization of dynamics:
- Multi‑camera images and vehicle state are processed by a vision encoder.
- A language encoder ingests a textual description of the current scenario (e.g., “a pedestrian is crossing the crosswalk”).
- Both modalities are fused into a shared latent token space that represents the state of every agent.
Multi‑task supervision:
- Planning loss – predicts the ego’s next control command.
- Motion loss – predicts short‑term trajectories for surrounding agents.
- Language alignment loss – forces the latent tokens to be predictable from the scenario description, enabling later text‑conditioned generation.
Autoregressive rollout:
- Starting from the current latent tokens, GeRo samples the next token set conditioned on a scenario prompt (e.g., “the traffic light turns red”).
- The newly generated tokens are fed back into the model to produce the next step, repeating for a desired horizon.
Rollout‑consistency loss:
- During training, the model is asked to reconstruct ground‑truth future tokens or pseudo‑labels generated by a teacher network.
- This loss penalizes divergence between the generated rollout and the reference, keeping the language‑action alignment stable over many steps.
Reinforcement learning fine‑tuning:
- The generative rollouts serve as a simulator for policy improvement.
- Standard RL objectives (e.g., collision avoidance, lane‑keeping) are optimized on top of the pretrained VLA+GeRo stack.

Results & Findings

Metric	Baseline VLA	VLA + GeRo (open‑loop)	VLA + GeRo (closed‑loop)
Driving score (Bench2Drive)	62.3	78.0 (+15.7)	84.5
Success rate (complete route)	48 %	74 % (+26.2)	81 %
Zero‑shot performance (new weather)	55 %	70 %	76 %

Temporal consistency: Generated rollouts stay coherent for up to 10 s of simulated driving, far longer than previous VLA rollouts that collapse after a few seconds.
Interpretability: The model can output natural‑language explanations for its actions (e.g., “I slow down because the pedestrian is about to cross”), which were validated by human evaluators as accurate 82 % of the time.
RL synergy: Adding RL on top of GeRo improves closed‑loop safety metrics (collision rate ↓ 34 %) without sacrificing the generative capabilities.

Practical Implications

Safer simulation‑in‑the‑loop testing: Developers can use GeRo to generate realistic, language‑guided traffic scenarios on‑the‑fly, reducing the need for hand‑crafted test maps.
Explainable autonomous agents: The natural‑language responses give engineers and regulators a readable audit trail for why a particular maneuver was chosen.
Rapid prototyping of new policies: Because GeRo works as a plug‑in, existing perception‑planning stacks can be upgraded to support long‑horizon reasoning with minimal code changes.
Zero‑shot adaptation: Fleet operators can issue high‑level textual updates (“treat school zones as high‑risk”) and have the model instantly adjust its behavior without retraining the perception layers.
Multi‑agent coordination: The generative rollout can be extended to predict cooperative maneuvers (e.g., merging) by conditioning on joint scenario descriptions, opening doors to V2X‑enabled planning.

Limitations & Future Work

Scalability of token length: Autoregressive generation becomes computationally heavy for horizons beyond ~15 s; future work could explore hierarchical rollouts or diffusion‑based generation.
Reliance on high‑quality language annotations: The current training set uses curated scenario captions; scaling to noisy, crowd‑sourced descriptions may require robust language grounding techniques.
Domain gap to real‑world sensor noise: Bench2Drive is a simulated benchmark; transferring GeRo to real‑world fleets will need additional domain‑adaptation strategies (e.g., self‑supervised fine‑tuning).
Multi‑modal extensions: Incorporating lidar or radar tokens could improve robustness in adverse weather, a direction the authors plan to investigate.

Bottom line: GeRo shows that treating an autonomous driving model as a generative, language‑conditioned reasoning engine can boost safety, interpretability, and adaptability—qualities that are increasingly demanded by developers building the next generation of self‑driving systems.

Authors

Rajeev Yasarla
Deepti Hegde
Shizhong Han
Hsin-Pai Cheng
Yunxiao Shi
Meysam Sadeghigooghari
Shweta Mahajan
Apratim Bhattacharyya
Litian Liu
Risheek Garrepalli
Thomas Svantesson
Fatih Porikli
Hong Cai

Paper Information

arXiv ID: 2601.11475v1
Categories: cs.CV
Published: January 16, 2026
PDF: Download PDF

[Paper] Generative Scenario Rollouts for End-to-End Autonomous Driving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation