[Paper] Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Published: 23 hours ago (March 10, 2026 at 04:10 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.09335v1

Overview

The paper investigates whether ChatGPT—without any exposure to proprietary requirement documents—can be used to generate synthetic system requirement specifications (SSyRSs) that feel realistic to domain experts. By producing 300 specifications across ten industry domains and evaluating them with both automated checks and a survey of 87 professionals, the authors show that large language models can approximate real‑world requirements, but still need human vetting.

Key Contributions

Systematic Prompt‑Engineering Pipeline – a repeatable workflow that combines prompt patterns, LLM‑based self‑assessment, and iterative refinement to produce SSyRSs.
Large‑Scale Generation Study – creation of 300 synthetic specifications spanning ten distinct industry sectors.
Cross‑Model Validation – use of multiple LLMs to automatically flag inconsistencies and hallucinations in the generated texts.
Empirical Expert Evaluation – a survey of 87 practitioners revealing that 62 % of the synthetic specs were judged “realistic.”
Critical Insight on LLM Limitations – identification of contradictory statements and subtle quality gaps that automated checks missed, underscoring the need for human review.

Methodology

Domain Selection & Prompt Design – The authors chose ten representative industries (e.g., automotive, healthcare) and crafted a set of prompt templates that ask ChatGPT to produce a full requirement specification for a fictional system.
Iterative Generation Loop – Each prompt is sent to ChatGPT; the output is then fed back into the model with a “self‑assessment” prompt asking it to rate completeness, consistency, and adherence to typical requirement‑writing conventions.
Cross‑Model Checks – A second LLM (e.g., GPT‑4 or Claude) reviews the same text, flagging contradictions, ambiguous phrasing, or missing non‑functional requirements.
Human Expert Survey – The resulting 300 specifications are anonymized and presented to 87 industry professionals via an online questionnaire. Participants rate realism, clarity, and usefulness on Likert scales and provide free‑form comments.
Data Analysis – Quantitative scores are aggregated, and qualitative feedback is coded to surface common failure modes (e.g., hallucinated stakeholder names, inconsistent functional/non‑functional requirements).

Results & Findings

Realism Rating: 62 % of the synthetic specs received a “realistic” rating (≥4 on a 5‑point scale).
Consistency Issues: Automated cross‑model checks caught 18 % of contradictions, but experts identified an additional 12 % that the models missed.
Domain Variability: Specs for highly regulated domains (healthcare, aerospace) were judged less realistic than those for less formal sectors (e‑commerce, entertainment).
Prompt Refinement Impact: Iterating the prompt template twice improved overall realism scores by roughly 9 %, demonstrating the value of systematic prompt tuning.
Hallucination Patterns: The most common hallucinations involved invented standards (e.g., “ISO 12345”) and fictitious stakeholder roles, which reduced trustworthiness.

Practical Implications

Rapid Prototyping of Test Data: Teams can bootstrap requirement‑driven testing, traceability analysis, or NLP tool benchmarking without waiting for legally cleared real documents.
Training Data for Requirement‑Mining Models: Synthetic specs can augment scarce corpora for supervised learning in requirement classification, inconsistency detection, or automated traceability.
Scenario Planning & Education: Product managers and students can experiment with “what‑if” requirement sets for new domains, facilitating workshops and design‑thinking sessions.
Cost‑Effective Mock‑ups for Tool Vendors: Companies building requirement‑management platforms can generate realistic‑looking demo data to showcase UI/UX features to prospects.
Cautionary Note: Because hallucinations remain, any downstream automation (e.g., automatic test case generation) should incorporate a human validation step or a secondary LLM audit.

Limitations & Future Work

Reliance on Prompt Engineering: The quality of SSyRSs hinges on manually crafted prompts; scaling to dozens of domains may require automated prompt synthesis.
Limited Expert Pool: The survey involved 87 participants, primarily from Europe and North America; broader cultural and regulatory contexts could affect realism judgments.
Static LLM Version: Results are tied to the specific ChatGPT model used; newer or fine‑tuned models might reduce hallucinations but were not examined.
Future Directions: The authors suggest (1) integrating domain‑specific knowledge bases to curb invented standards, (2) exploring few‑shot fine‑tuning on a small set of real requirements, and (3) building a continuous feedback loop where expert corrections are fed back into the prompt‑generation pipeline.

Authors

Alex R. Mattukat
Florian M. Braun
Horst Lichter

Paper Information

arXiv ID: 2603.09335v1
Categories: cs.SE
Published: March 10, 2026
PDF: Download PDF

[Paper] Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

[Paper] Preparing Students for AI-Driven Agile Development: A Project-Based AI Engineering Curriculum

[Paper] EmbC-Test: How to Speed Up Embedded Software Testing Using LLMs and RAG

[Paper] Towards Viewpoint-centric Artifact-based Regulatory Requirements Engineering for Compliance by Design