[Paper] Behaviour Driven Development Scenario Generation with Large Language Models

Published: 1 day ago (March 4, 2026 at 09:05 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.04729v1

Overview

The paper evaluates how three state‑of‑the‑art large language models (LLMs)—GPT‑4, Claude 3, and Gemini—can automatically generate Behaviour‑Driven Development (BDD) scenarios from software requirements. By building a 500‑item dataset of real user stories and their hand‑crafted BDD scenarios, the authors show which prompting tricks and model settings yield the most developer‑friendly output.

Key Contributions

Large‑scale empirical dataset: 500 paired user stories, requirement descriptions, and BDD scenarios collected from four proprietary products.
Cross‑model comparison: Systematic evaluation of GPT‑4, Claude 3, and Gemini using multiple quantitative and qualitative metrics.
Multidimensional evaluation framework: Combines text‑level similarity, semantic similarity, LLM‑based scoring, and human expert ratings.
Prompt engineering insights: Identifies model‑specific prompting strategies (zero‑shot, chain‑of‑thought, few‑shot) that maximize scenario quality.
Parameter tuning guidelines: Demonstrates that temperature = 0 and top_p = 1.0 consistently produce the best BDD outputs across models.
Correlation analysis: Shows that LLM‑based evaluators (especially DeepSeek) align more closely with human judgments than traditional similarity metrics.

Methodology

Dataset construction – The authors extracted 500 real‑world user stories and their corresponding BDD scenarios from four in‑house software products. Each entry includes a short user story, a detailed requirement description, and a manually written BDD scenario.
Prompt design – For each LLM they experimented with three prompting styles:
- Zero‑shot: a single instruction without examples.
- Chain‑of‑thought: a step‑by‑step reasoning prompt.
- Few‑shot: a few example pairs supplied in the prompt.
  The optimal style differed per model (GPT‑4 → zero‑shot, Claude 3 → chain‑of‑thought, Gemini → few‑shot).
Generation settings – Temperature was set to 0 (deterministic output) and top_p to 1.0 for all runs, based on preliminary sweeps.
Evaluation – Four complementary lenses were applied:
- Text similarity (BLEU, ROUGE) against the reference scenario.
- Semantic similarity (sentence‑BERT cosine similarity).
- LLM‑based scoring using separate evaluator models (e.g., DeepSeek).
- Human expert assessment where BDD practitioners rated relevance, completeness, and readability.

Results & Findings

Model	Best Prompt	Text/Semantic Scores	Human Rating	LLM‑Evaluator Score
GPT‑4	Zero‑shot	Highest BLEU/ROUGE	3.8 / 5	4.0 / 5
Claude 3	Chain‑of‑thought	Slightly lower BLEU/ROUGE	4.4 / 5 (top)	4.5 / 5
Gemini	Few‑shot	Mid‑range similarity	4.0 / 5	4.2 / 5

Claude 3 consistently earned the highest human and LLM‑evaluator scores, even though its raw text similarity lagged behind GPT‑4.
DeepSeek’s evaluator scores correlated more strongly with human judgments (ρ ≈ 0.78) than BLEU/ROUGE (ρ ≈ 0.45).
Input quality matters: Providing a detailed requirement description alone yields scenarios comparable to those generated from both user story + description; using only the terse user story leads to poor BDD output.
Parameter impact: Temperature = 0 and top_p = 1.0 reduced hallucinations and improved consistency across all models.

Practical Implications

Accelerated BDD authoring – Teams can seed a BDD suite by feeding detailed requirement docs to Claude 3 (or GPT‑4 with zero‑shot) and obtain high‑quality scenarios, cutting manual writing time by up to 60 % in pilot studies.
Prompt‑template libraries – The model‑specific prompting recipes can be packaged as reusable templates in CI/CD pipelines or IDE extensions, enabling “one‑click” scenario generation.
Quality control via LLM evaluators – Deploying a secondary LLM (e.g., DeepSeek) as an automated reviewer can flag low‑quality scenarios before they enter the test suite, reducing false positives in BDD test runs.
Tooling integration – The findings map cleanly onto existing BDD frameworks (Cucumber, Behave). Generated Gherkin files can be directly imported, allowing rapid iteration on acceptance criteria.
Cost‑effective scaling – Since deterministic settings (temp = 0) avoid the need for multiple sampling, organizations can keep API usage low while maintaining output quality.

Limitations & Future Work

Domain coverage – The dataset stems from four proprietary products; results may differ for domains with highly specialized vocabularies (e.g., embedded systems, finance).
Human evaluation size – Only a limited pool of BDD experts participated, which could bias the subjective scores.
Model updates – As LLMs evolve rapidly, the relative rankings could shift; continuous benchmarking is needed.
Future directions suggested by the authors include: expanding the dataset to open‑source projects, exploring multimodal prompts (e.g., diagrams), and integrating reinforcement learning from human feedback to fine‑tune LLMs specifically for BDD scenario generation.

Authors

Amila Rathnayake
Mojtaba Shahin
Golnoush Abaei

Paper Information

arXiv ID: 2603.04729v1
Categories: cs.SE
Published: March 5, 2026
PDF: Download PDF

[Paper] Behaviour Driven Development Scenario Generation with Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

[Paper] A Benchmarking Framework for Model Datasets

[Paper] Why Do You Contribute to Stack Overflow? Understanding Cross-Cultural Motivations and Usage Patterns before the Age of LLMs

[Paper] Auto-Generating Personas from User Reviews in VR App Stores