[Paper] Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Published: (March 10, 2026 at 04:10 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.09335v1

Overview

The paper investigates whether ChatGPT—without any exposure to proprietary requirement documents—can be used to generate synthetic system requirement specifications (SSyRSs) that feel realistic to domain experts. By producing 300 specifications across ten industry domains and evaluating them with both automated checks and a survey of 87 professionals, the authors show that large language models can approximate real‑world requirements, but still need human vetting.

Key Contributions

  • Systematic Prompt‑Engineering Pipeline – a repeatable workflow that combines prompt patterns, LLM‑based self‑assessment, and iterative refinement to produce SSyRSs.
  • Large‑Scale Generation Study – creation of 300 synthetic specifications spanning ten distinct industry sectors.
  • Cross‑Model Validation – use of multiple LLMs to automatically flag inconsistencies and hallucinations in the generated texts.
  • Empirical Expert Evaluation – a survey of 87 practitioners revealing that 62 % of the synthetic specs were judged “realistic.”
  • Critical Insight on LLM Limitations – identification of contradictory statements and subtle quality gaps that automated checks missed, underscoring the need for human review.

Methodology

  1. Domain Selection & Prompt Design – The authors chose ten representative industries (e.g., automotive, healthcare) and crafted a set of prompt templates that ask ChatGPT to produce a full requirement specification for a fictional system.
  2. Iterative Generation Loop – Each prompt is sent to ChatGPT; the output is then fed back into the model with a “self‑assessment” prompt asking it to rate completeness, consistency, and adherence to typical requirement‑writing conventions.
  3. Cross‑Model Checks – A second LLM (e.g., GPT‑4 or Claude) reviews the same text, flagging contradictions, ambiguous phrasing, or missing non‑functional requirements.
  4. Human Expert Survey – The resulting 300 specifications are anonymized and presented to 87 industry professionals via an online questionnaire. Participants rate realism, clarity, and usefulness on Likert scales and provide free‑form comments.
  5. Data Analysis – Quantitative scores are aggregated, and qualitative feedback is coded to surface common failure modes (e.g., hallucinated stakeholder names, inconsistent functional/non‑functional requirements).

Results & Findings

  • Realism Rating: 62 % of the synthetic specs received a “realistic” rating (≥4 on a 5‑point scale).
  • Consistency Issues: Automated cross‑model checks caught 18 % of contradictions, but experts identified an additional 12 % that the models missed.
  • Domain Variability: Specs for highly regulated domains (healthcare, aerospace) were judged less realistic than those for less formal sectors (e‑commerce, entertainment).
  • Prompt Refinement Impact: Iterating the prompt template twice improved overall realism scores by roughly 9 %, demonstrating the value of systematic prompt tuning.
  • Hallucination Patterns: The most common hallucinations involved invented standards (e.g., “ISO 12345”) and fictitious stakeholder roles, which reduced trustworthiness.

Practical Implications

  • Rapid Prototyping of Test Data: Teams can bootstrap requirement‑driven testing, traceability analysis, or NLP tool benchmarking without waiting for legally cleared real documents.
  • Training Data for Requirement‑Mining Models: Synthetic specs can augment scarce corpora for supervised learning in requirement classification, inconsistency detection, or automated traceability.
  • Scenario Planning & Education: Product managers and students can experiment with “what‑if” requirement sets for new domains, facilitating workshops and design‑thinking sessions.
  • Cost‑Effective Mock‑ups for Tool Vendors: Companies building requirement‑management platforms can generate realistic‑looking demo data to showcase UI/UX features to prospects.
  • Cautionary Note: Because hallucinations remain, any downstream automation (e.g., automatic test case generation) should incorporate a human validation step or a secondary LLM audit.

Limitations & Future Work

  • Reliance on Prompt Engineering: The quality of SSyRSs hinges on manually crafted prompts; scaling to dozens of domains may require automated prompt synthesis.
  • Limited Expert Pool: The survey involved 87 participants, primarily from Europe and North America; broader cultural and regulatory contexts could affect realism judgments.
  • Static LLM Version: Results are tied to the specific ChatGPT model used; newer or fine‑tuned models might reduce hallucinations but were not examined.
  • Future Directions: The authors suggest (1) integrating domain‑specific knowledge bases to curb invented standards, (2) exploring few‑shot fine‑tuning on a small set of real requirements, and (3) building a continuous feedback loop where expert corrections are fed back into the prompt‑generation pipeline.

Authors

  • Alex R. Mattukat
  • Florian M. Braun
  • Horst Lichter

Paper Information

  • arXiv ID: 2603.09335v1
  • Categories: cs.SE
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »