[Paper] An LLM-driven Scenario Generation Pipeline Using an Extended Scenic DSL for Autonomous Driving Safety Validation

Published: (February 24, 2026 at 02:44 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.20644v1

Overview

A new research pipeline shows how to turn messy, real‑world crash reports—text descriptions plus hand‑drawn sketches—into fully‑executable autonomous‑driving test scenarios. By coupling GPT‑4o mini with an extended version of the Scenic domain‑specific language (DSL), the authors automate the extraction of high‑level semantics and the generation of reliable simulation inputs, dramatically easing the validation workload for autonomous‑driving systems (ADS).

Key Contributions

  • LLM‑augmented parsing – Uses GPT‑4o mini to interpret multimodal crash reports (text + sketches) and produce a structured, probabilistic Scenic representation.
  • Extended Scenic DSL – Introduces new constructs for road‑network attributes, traffic‑rule “oracles,” and stochastic actor trajectories, bridging the gap between natural‑language intent and low‑level simulator commands.
  • Two‑stage pipeline – Separates semantic understanding (LLM) from concrete scenario rendering (Scenic → CARLA), reducing error propagation compared with end‑to‑end text‑to‑scenario methods.
  • Comprehensive evaluation – Validated on NHTSA CIREN crash cases, achieving near‑perfect extraction accuracy (100 % for environment/network, >97 % for oracles and trajectories).
  • Scalable stress testing – Generated 2,000 variations of each scenario; all triggered the intended traffic‑rule violations when run with the Autoware stack in CARLA.

Methodology

  1. Data Ingestion – Each crash report is fed to GPT‑4o mini together with its accompanying sketch. The model is prompted to identify key entities (road layout, weather, vehicle states) and to express uncertainties probabilistically.
  2. Intermediate Representation – The extracted semantics are encoded in an Extended Scenic DSL. This DSL adds:
    • RoadNetwork objects with lane markings, traffic‑light locations, and legal maneuvers.
    • Oracle predicates that capture the safety violation (e.g., “vehicle crosses opposite lane”).
    • Stochastic actor definitions that model the range of possible speeds, headings, and reaction times observed in the real crash.
  3. Scenario Synthesis – A Scenic interpreter translates the DSL script into concrete simulation assets for the CARLA simulator (maps, vehicle models, sensor suites).
  4. Execution & Verification – The generated scenario runs with the open‑source Autoware driving stack. Sensors feed into Autoware, which then attempts to navigate the scene. A post‑run validator checks whether the predefined oracle condition was met.
  5. Variation Generation – By sampling the probabilistic parameters in the Scenic script, thousands of realistic variants are automatically produced, enabling large‑scale safety testing.

Results & Findings

AspectAccuracy vs. Human Ground Truth
Environmental & road‑network attributes100 %
Oracle (rule‑violation) extraction97 %
Actor trajectory extraction98 %

When executed in CARLA with Autoware, every generated variant reproduced the target violation (e.g., opposite‑lane crossing, red‑light run). The pipeline proved legally grounded (the DSL captures the same regulatory language used in crash reports) and verifiable—the intermediate Scenic script can be inspected and audited before simulation.

Practical Implications

  • Accelerated Safety Validation – Engineers can ingest existing crash databases and instantly obtain a library of realistic test cases, cutting weeks of manual scenario authoring.
  • Regulatory Alignment – Because the DSL mirrors legal descriptions of traffic rules, generated scenarios can be directly referenced in compliance reports or safety cases.
  • Stress‑Testing at Scale – The probabilistic DSL enables systematic exploration of “what‑if” variations (different weather, driver reaction times) without hand‑crafting each case.
  • Toolchain Integration – The pipeline plugs into existing simulation stacks (CARLA, LGSVL) and open‑source autonomy stacks (Autoware, Apollo), making adoption straightforward for developers.
  • Reduced Human Error – By delegating the noisy text‑to‑semantic translation to an LLM and keeping a deterministic Scenic rendering step, the approach mitigates the misinterpretations that plagued earlier end‑to‑end generators like ScenicNL or LCTGen.

Limitations & Future Work

  • LLM Hallucinations – Although accuracy was high on the evaluated CIREN set, the system still depends on the LLM’s reliability; rare mis‑extractions could propagate into unsafe test scenarios.
  • Sketch Interpretation – The current pipeline treats sketches as auxiliary cues; a more robust vision‑based parser could capture finer geometric details.
  • Domain Generalization – Validation was limited to US‑centric crash reports; extending to other jurisdictions with different traffic rules may require DSL extensions.
  • Closed‑Loop Testing – The study focused on triggering rule violations; future work could incorporate adaptive adversarial actors that react to the ADS in real time.

Bottom line: By marrying a powerful LLM with a probabilistic Scenic DSL, this work offers a practical, scalable route for developers to transform legacy crash data into high‑fidelity, verifiable simulation scenarios—an essential step toward safer, more trustworthy autonomous vehicles.

Authors

  • Fida Khandaker Safa
  • Yupeng Jiang
  • Xi Zheng

Paper Information

  • arXiv ID: 2602.20644v1
  • Categories: cs.SE
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »