[Paper] RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies
Source: arXiv - 2512.01993v1
Overview
The paper introduces Rollouts as Demonstrations (RoaD), a lightweight technique that turns an autonomous‑driving policy’s own closed‑loop trajectories into extra training data. By mixing these self‑generated rollouts with a modest amount of expert guidance, RoaD dramatically reduces the covariate‑shift problem that plagues standard behavior‑cloning pipelines, delivering safer, more reliable driving without the massive data or compute budgets of reinforcement learning.
Key Contributions
- Closed‑loop supervised fine‑tuning without heavy RL: RoaD uses the policy’s own rollouts as pseudo‑demonstrations, sidestepping the need for expensive reward engineering or massive on‑policy data collection.
- Expert‑biased rollout generation: A lightweight expert controller nudges the policy during rollout generation, ensuring the resulting trajectories stay within the distribution of high‑quality driving behavior.
- Data efficiency: Achieves comparable or superior performance to prior closed‑loop supervised fine‑tuning (CL‑SFT) methods while using orders of magnitude less data than typical RL approaches.
- Broad applicability: Works for both modular pipelines (e.g., perception‑planning‑control stacks) and end‑to‑end neural driving models, demonstrated on two distinct simulators.
- Significant safety gains: In the high‑fidelity AlpaSim benchmark, RoaD boosts the overall driving score by 41 % and cuts collision rates by 54 %.
Methodology
- Start from a base policy trained via conventional open‑loop behavior cloning on human driving logs.
- Generate closed‑loop rollouts: Run the base policy in simulation, but intermittently inject a simple expert controller (e.g., a rule‑based planner) that gently corrects the vehicle’s trajectory toward safe, goal‑directed behavior. This hybrid execution yields realistic trajectories that still reflect the policy’s own decision‑making quirks.
- Treat rollouts as demonstrations: Record the state‑action pairs from these hybrid runs and add them to the original supervised dataset.
- Fine‑tune the policy: Perform a standard supervised learning step on the augmented dataset, letting the network learn to correct the errors it previously made when acting closed‑loop.
- Iterate (optional): The process can be repeated, progressively refining the policy as it becomes more competent at staying on safe trajectories.
The key insight is that the policy’s own mistakes become valuable training signals when they are “rescued” by the expert, providing a curriculum that gradually pushes the model toward robust closed‑loop performance.
Results & Findings
| Benchmark | Baseline (BC) | Prior CL‑SFT | RoaD (this work) |
|---|---|---|---|
| WOSAC (large‑scale traffic sim) | – | Comparable | Equal or better performance, with far fewer fine‑tuning samples |
| AlpaSim (high‑fidelity end‑to‑end) | Driving score: 0.62, Collisions: 0.18 | – | Score: 0.88 (+41 %), Collisions: 0.08 (‑54 %) |
- Data efficiency: RoaD required roughly 1/10th the amount of fine‑tuning data that prior CL‑SFT needed to reach similar safety metrics.
- Training time: Because the method stays within the supervised learning regime, fine‑tuning converged in a few epochs on a single GPU, unlike RL which often needs days on multi‑GPU clusters.
- Generalization: The policy retained its ability to handle diverse traffic scenarios, indicating that the expert‑biased rollouts did not over‑fit to a narrow set of situations.
Practical Implications
- Faster iteration cycles: Development teams can improve a driving stack’s closed‑loop robustness with a few hours of simulation and a modest compute budget, dramatically shortening the validation loop.
- Lower data collection costs: Instead of gathering massive on‑vehicle logs or running costly RL simulations, engineers can reuse existing behavior‑cloning datasets and augment them with cheap, rule‑based expert rollouts.
- Safety certification aid: The method produces demonstrable, human‑readable trajectories that can be inspected for compliance, easing the path toward regulatory approval.
- Plug‑and‑play for existing pipelines: RoaD works with any differentiable policy (CNNs, transformers, modular controllers), making it a drop‑in fine‑tuning step for both legacy and cutting‑edge autonomous‑driving stacks.
- Potential for continuous learning: Vehicles could periodically generate expert‑biased rollouts on‑fleet (e.g., in a shadow mode) and upload them for remote fine‑tuning, enabling data‑efficient, lifelong improvement.
Limitations & Future Work
- Reliance on a reasonable expert: The quality of the pseudo‑demonstrations hinges on the expert controller’s ability to keep trajectories safe yet realistic; a poorly designed expert could bias the policy toward suboptimal behavior.
- Simulation‑to‑real gap: While results are promising in high‑fidelity simulators, transferring the gains to real‑world driving may require additional domain‑adaptation steps.
- Scalability to extreme edge cases: Rare, safety‑critical scenarios (e.g., sudden pedestrian darts) may still be under‑represented in the generated rollouts, suggesting a hybrid approach with targeted scenario generation.
- Future directions: The authors propose exploring adaptive expert weighting (more guidance when the policy is uncertain), integrating uncertainty estimation to focus rollout generation on high‑risk states, and extending RoaD to multi‑agent coordination tasks beyond single‑vehicle driving.
Authors
- Guillermo Garcia‑Cobo
- Maximilian Igl
- Peter Karkus
- Zhejun Zhang
- Michael Watson
- Yuxiao Chen
- Boris Ivanovic
- Marco Pavone
Paper Information
- arXiv ID: 2512.01993v1
- Categories: cs.RO, cs.AI, cs.CV, cs.LG
- Published: December 1, 2025
- PDF: Download PDF