[Paper] Behaviour Driven Development Scenario Generation with Large Language Models
Source: arXiv - 2603.04729v1
Overview
The paper evaluates how three state‑of‑the‑art large language models (LLMs)—GPT‑4, Claude 3, and Gemini—can automatically generate Behaviour‑Driven Development (BDD) scenarios from software requirements. By building a 500‑item dataset of real user stories and their hand‑crafted BDD scenarios, the authors show which prompting tricks and model settings yield the most developer‑friendly output.
Key Contributions
- Large‑scale empirical dataset: 500 paired user stories, requirement descriptions, and BDD scenarios collected from four proprietary products.
- Cross‑model comparison: Systematic evaluation of GPT‑4, Claude 3, and Gemini using multiple quantitative and qualitative metrics.
- Multidimensional evaluation framework: Combines text‑level similarity, semantic similarity, LLM‑based scoring, and human expert ratings.
- Prompt engineering insights: Identifies model‑specific prompting strategies (zero‑shot, chain‑of‑thought, few‑shot) that maximize scenario quality.
- Parameter tuning guidelines: Demonstrates that temperature = 0 and top_p = 1.0 consistently produce the best BDD outputs across models.
- Correlation analysis: Shows that LLM‑based evaluators (especially DeepSeek) align more closely with human judgments than traditional similarity metrics.
Methodology
- Dataset construction – The authors extracted 500 real‑world user stories and their corresponding BDD scenarios from four in‑house software products. Each entry includes a short user story, a detailed requirement description, and a manually written BDD scenario.
- Prompt design – For each LLM they experimented with three prompting styles:
- Zero‑shot: a single instruction without examples.
- Chain‑of‑thought: a step‑by‑step reasoning prompt.
- Few‑shot: a few example pairs supplied in the prompt.
The optimal style differed per model (GPT‑4 → zero‑shot, Claude 3 → chain‑of‑thought, Gemini → few‑shot).
- Generation settings – Temperature was set to 0 (deterministic output) and top_p to 1.0 for all runs, based on preliminary sweeps.
- Evaluation – Four complementary lenses were applied:
- Text similarity (BLEU, ROUGE) against the reference scenario.
- Semantic similarity (sentence‑BERT cosine similarity).
- LLM‑based scoring using separate evaluator models (e.g., DeepSeek).
- Human expert assessment where BDD practitioners rated relevance, completeness, and readability.
Results & Findings
| Model | Best Prompt | Text/Semantic Scores | Human Rating | LLM‑Evaluator Score |
|---|---|---|---|---|
| GPT‑4 | Zero‑shot | Highest BLEU/ROUGE | 3.8 / 5 | 4.0 / 5 |
| Claude 3 | Chain‑of‑thought | Slightly lower BLEU/ROUGE | 4.4 / 5 (top) | 4.5 / 5 |
| Gemini | Few‑shot | Mid‑range similarity | 4.0 / 5 | 4.2 / 5 |
- Claude 3 consistently earned the highest human and LLM‑evaluator scores, even though its raw text similarity lagged behind GPT‑4.
- DeepSeek’s evaluator scores correlated more strongly with human judgments (ρ ≈ 0.78) than BLEU/ROUGE (ρ ≈ 0.45).
- Input quality matters: Providing a detailed requirement description alone yields scenarios comparable to those generated from both user story + description; using only the terse user story leads to poor BDD output.
- Parameter impact: Temperature = 0 and top_p = 1.0 reduced hallucinations and improved consistency across all models.
Practical Implications
- Accelerated BDD authoring – Teams can seed a BDD suite by feeding detailed requirement docs to Claude 3 (or GPT‑4 with zero‑shot) and obtain high‑quality scenarios, cutting manual writing time by up to 60 % in pilot studies.
- Prompt‑template libraries – The model‑specific prompting recipes can be packaged as reusable templates in CI/CD pipelines or IDE extensions, enabling “one‑click” scenario generation.
- Quality control via LLM evaluators – Deploying a secondary LLM (e.g., DeepSeek) as an automated reviewer can flag low‑quality scenarios before they enter the test suite, reducing false positives in BDD test runs.
- Tooling integration – The findings map cleanly onto existing BDD frameworks (Cucumber, Behave). Generated Gherkin files can be directly imported, allowing rapid iteration on acceptance criteria.
- Cost‑effective scaling – Since deterministic settings (temp = 0) avoid the need for multiple sampling, organizations can keep API usage low while maintaining output quality.
Limitations & Future Work
- Domain coverage – The dataset stems from four proprietary products; results may differ for domains with highly specialized vocabularies (e.g., embedded systems, finance).
- Human evaluation size – Only a limited pool of BDD experts participated, which could bias the subjective scores.
- Model updates – As LLMs evolve rapidly, the relative rankings could shift; continuous benchmarking is needed.
- Future directions suggested by the authors include: expanding the dataset to open‑source projects, exploring multimodal prompts (e.g., diagrams), and integrating reinforcement learning from human feedback to fine‑tune LLMs specifically for BDD scenario generation.
Authors
- Amila Rathnayake
- Mojtaba Shahin
- Golnoush Abaei
Paper Information
- arXiv ID: 2603.04729v1
- Categories: cs.SE
- Published: March 5, 2026
- PDF: Download PDF