[Paper] Workflow-Level Design Principles for Trustworthy GenAI in Automotive System Engineering

Published: 3 days ago (February 23, 2026 at 04:02 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.19614v1

Overview

The paper tackles a pressing hurdle for automotive engineers: how to safely embed large language models (LLMs)—the “GenAI” behind tools like ChatGPT—into the rigorous, safety‑critical workflows that define modern vehicle development. By proposing a set of workflow‑level design principles, the authors demonstrate a concrete, end‑to‑end pipeline that keeps GenAI outputs traceable, verifiable, and aligned with existing automotive standards (SysML v2, regression testing, etc.).

Key Contributions

Design principles for trustworthy GenAI at the workflow level, not just the model‑level, targeting safety‑critical system engineering.
Empirical comparison of monolithic (“big‑bang”) prompting vs. a section‑wise decomposition strategy with diversity sampling and lightweight NLP sanity checks, showing superior completeness and correctness.
Automated propagation of requirement deltas into SysML v2 architectural models, followed by compilation and static analysis to verify model integrity.
Traceable regression‑test generation that maps specification variables directly to architectural ports and states, enabling systematic re‑testing after GenAI‑driven updates.
A fully realized automotive case study that stitches together requirement change detection, model updating, and test generation into a single, repeatable pipeline.

Methodology

Prompt Decomposition – Split the specification into logical sections (e.g., functional, safety, performance). Each section is prompted separately, and multiple diverse responses are sampled.
Sanity‑Check Layer – Lightweight NLP heuristics (keyword consistency, type checking, unit validation) automatically flag implausible outputs before they reach engineers.
Delta Extraction – Detect changes (“deltas”) between the original and revised requirements, producing a structured list of modifications.
Model Update Engine – Using the delta list, a generator updates SysML v2 models: adding/removing ports, adjusting state‑machine transitions, and synchronizing documentation. The updated model is then compiled and run through static analysis tools to catch structural errors early.
Traceable Test Synthesis – Each requirement variable is explicitly linked to a model element (port/state). The system auto‑generates regression test cases that exercise those elements, ensuring any GenAI‑induced change is validated against the original safety criteria.

Results & Findings

Completeness Boost: Section‑wise prompting captured ≈ 22 % more requirement changes than the monolithic approach in a 500‑page automotive spec benchmark.
Error Reduction: The sanity‑check filter cut down false‑positive generation by ≈ 35 %, reducing manual review effort.
Model Integrity: Updated SysML v2 models compiled without errors in 96 % of runs, compared to 78 % when using naïve big‑bang updates.
Regression Coverage: Generated test suites achieved > 90 % coverage of the changed specification variables, providing a quantifiable safety net for each GenAI‑driven edit.

These numbers illustrate that a disciplined workflow can make GenAI a reliable co‑author rather than a risky black box.

Practical Implications

Accelerated Change Management: Automotive OEMs can now lean on GenAI to quickly propagate requirement updates across complex model hierarchies, cutting weeks of manual re‑modeling down to hours.
Regulatory Alignment: By embedding traceability (requirement ↔ model ↔ test) directly into the pipeline, companies can satisfy ISO‑26262 and other functional‑safety standards without extra documentation overhead.
Developer Tooling: The approach can be packaged as a plug‑in for existing SysML v2 toolchains (e.g., Cameo, Enterprise Architect), letting engineers invoke “smart diff‑assist” commands from within their familiar IDEs.
Risk‑Based Deployment: Teams can adopt a progressive rollout—starting with low‑criticality subsystems—while the sanity‑check layer provides a safety net that flags any out‑of‑spec suggestions before they reach production code.
Cost Savings: Early detection of specification mismatches and automated test generation reduce costly re‑work later in the V‑model, especially for over‑the‑air (OTA) updates where safety verification is mandatory.

Limitations & Future Work

Domain Specificity: Focuses on automotive SysML v2 models; applying the workflow to other domains (e.g., aerospace, medical devices) may require custom sanity‑check rules and model adapters.
LLM Dependence: Results hinge on the underlying LLM’s knowledge base; newer or domain‑fine‑tuned models could shift performance, necessitating continual re‑evaluation.
Scalability of Diversity Sampling: Generating many diverse responses per section can increase compute cost; future work will explore adaptive sampling techniques to balance cost vs. coverage.
Human‑in‑the‑Loop Evaluation: A formal user study measuring engineer trust and acceptance is still pending.

The authors suggest extending the framework to support continuous integration pipelines, integrating real‑time monitoring of GenAI suggestions, and exploring formal verification of generated model updates as next steps.

Authors

Chih-Hong Cheng
Brian Hsuan-Cheng Liao
Adam Molin
Hasan Esen

Paper Information

arXiv ID: 2602.19614v1
Categories: cs.SE, cs.LG
Published: February 23, 2026
PDF: Download PDF

[Paper] Workflow-Level Design Principles for Trustworthy GenAI in Automotive System Engineering

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Surrogate models for Rock-Fluid Interaction: A Grid-Size-Invariant Approach