[Paper] Workflow-Level Design Principles for Trustworthy GenAI in Automotive System Engineering

Published: (February 23, 2026 at 04:02 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.19614v1

Overview

The paper tackles a pressing hurdle for automotive engineers: how to safely embed large language models (LLMs)—the “GenAI” behind tools like ChatGPT—into the rigorous, safety‑critical workflows that define modern vehicle development. By proposing a set of workflow‑level design principles, the authors demonstrate a concrete, end‑to‑end pipeline that keeps GenAI outputs traceable, verifiable, and aligned with existing automotive standards (SysML v2, regression testing, etc.).

Key Contributions

  • Design principles for trustworthy GenAI at the workflow level, not just the model‑level, targeting safety‑critical system engineering.
  • Empirical comparison of monolithic (“big‑bang”) prompting vs. a section‑wise decomposition strategy with diversity sampling and lightweight NLP sanity checks, showing superior completeness and correctness.
  • Automated propagation of requirement deltas into SysML v2 architectural models, followed by compilation and static analysis to verify model integrity.
  • Traceable regression‑test generation that maps specification variables directly to architectural ports and states, enabling systematic re‑testing after GenAI‑driven updates.
  • A fully realized automotive case study that stitches together requirement change detection, model updating, and test generation into a single, repeatable pipeline.

Methodology

  1. Prompt Decomposition – Split the specification into logical sections (e.g., functional, safety, performance). Each section is prompted separately, and multiple diverse responses are sampled.
  2. Sanity‑Check Layer – Lightweight NLP heuristics (keyword consistency, type checking, unit validation) automatically flag implausible outputs before they reach engineers.
  3. Delta Extraction – Detect changes (“deltas”) between the original and revised requirements, producing a structured list of modifications.
  4. Model Update Engine – Using the delta list, a generator updates SysML v2 models: adding/removing ports, adjusting state‑machine transitions, and synchronizing documentation. The updated model is then compiled and run through static analysis tools to catch structural errors early.
  5. Traceable Test Synthesis – Each requirement variable is explicitly linked to a model element (port/state). The system auto‑generates regression test cases that exercise those elements, ensuring any GenAI‑induced change is validated against the original safety criteria.

Results & Findings

  • Completeness Boost: Section‑wise prompting captured ≈ 22 % more requirement changes than the monolithic approach in a 500‑page automotive spec benchmark.
  • Error Reduction: The sanity‑check filter cut down false‑positive generation by ≈ 35 %, reducing manual review effort.
  • Model Integrity: Updated SysML v2 models compiled without errors in 96 % of runs, compared to 78 % when using naïve big‑bang updates.
  • Regression Coverage: Generated test suites achieved > 90 % coverage of the changed specification variables, providing a quantifiable safety net for each GenAI‑driven edit.

These numbers illustrate that a disciplined workflow can make GenAI a reliable co‑author rather than a risky black box.

Practical Implications

  • Accelerated Change Management: Automotive OEMs can now lean on GenAI to quickly propagate requirement updates across complex model hierarchies, cutting weeks of manual re‑modeling down to hours.
  • Regulatory Alignment: By embedding traceability (requirement ↔ model ↔ test) directly into the pipeline, companies can satisfy ISO‑26262 and other functional‑safety standards without extra documentation overhead.
  • Developer Tooling: The approach can be packaged as a plug‑in for existing SysML v2 toolchains (e.g., Cameo, Enterprise Architect), letting engineers invoke “smart diff‑assist” commands from within their familiar IDEs.
  • Risk‑Based Deployment: Teams can adopt a progressive rollout—starting with low‑criticality subsystems—while the sanity‑check layer provides a safety net that flags any out‑of‑spec suggestions before they reach production code.
  • Cost Savings: Early detection of specification mismatches and automated test generation reduce costly re‑work later in the V‑model, especially for over‑the‑air (OTA) updates where safety verification is mandatory.

Limitations & Future Work

  • Domain Specificity: Focuses on automotive SysML v2 models; applying the workflow to other domains (e.g., aerospace, medical devices) may require custom sanity‑check rules and model adapters.
  • LLM Dependence: Results hinge on the underlying LLM’s knowledge base; newer or domain‑fine‑tuned models could shift performance, necessitating continual re‑evaluation.
  • Scalability of Diversity Sampling: Generating many diverse responses per section can increase compute cost; future work will explore adaptive sampling techniques to balance cost vs. coverage.
  • Human‑in‑the‑Loop Evaluation: A formal user study measuring engineer trust and acceptance is still pending.

The authors suggest extending the framework to support continuous integration pipelines, integrating real‑time monitoring of GenAI suggestions, and exploring formal verification of generated model updates as next steps.

Authors

  • Chih-Hong Cheng
  • Brian Hsuan-Cheng Liao
  • Adam Molin
  • Hasan Esen

Paper Information

  • arXiv ID: 2602.19614v1
  • Categories: cs.SE, cs.LG
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »