[Paper] Behaviour Driven Development Scenario Generation with Large Language Models

Published: (March 4, 2026 at 09:05 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.04729v1

Overview

The paper evaluates how three state‑of‑the‑art large language models (LLMs)—GPT‑4, Claude 3, and Gemini—can automatically generate Behaviour‑Driven Development (BDD) scenarios from software requirements. By building a 500‑item dataset of real user stories and their hand‑crafted BDD scenarios, the authors show which prompting tricks and model settings yield the most developer‑friendly output.

Key Contributions

  • Large‑scale empirical dataset: 500 paired user stories, requirement descriptions, and BDD scenarios collected from four proprietary products.
  • Cross‑model comparison: Systematic evaluation of GPT‑4, Claude 3, and Gemini using multiple quantitative and qualitative metrics.
  • Multidimensional evaluation framework: Combines text‑level similarity, semantic similarity, LLM‑based scoring, and human expert ratings.
  • Prompt engineering insights: Identifies model‑specific prompting strategies (zero‑shot, chain‑of‑thought, few‑shot) that maximize scenario quality.
  • Parameter tuning guidelines: Demonstrates that temperature = 0 and top_p = 1.0 consistently produce the best BDD outputs across models.
  • Correlation analysis: Shows that LLM‑based evaluators (especially DeepSeek) align more closely with human judgments than traditional similarity metrics.

Methodology

  1. Dataset construction – The authors extracted 500 real‑world user stories and their corresponding BDD scenarios from four in‑house software products. Each entry includes a short user story, a detailed requirement description, and a manually written BDD scenario.
  2. Prompt design – For each LLM they experimented with three prompting styles:
    • Zero‑shot: a single instruction without examples.
    • Chain‑of‑thought: a step‑by‑step reasoning prompt.
    • Few‑shot: a few example pairs supplied in the prompt.
      The optimal style differed per model (GPT‑4 → zero‑shot, Claude 3 → chain‑of‑thought, Gemini → few‑shot).
  3. Generation settings – Temperature was set to 0 (deterministic output) and top_p to 1.0 for all runs, based on preliminary sweeps.
  4. Evaluation – Four complementary lenses were applied:
    • Text similarity (BLEU, ROUGE) against the reference scenario.
    • Semantic similarity (sentence‑BERT cosine similarity).
    • LLM‑based scoring using separate evaluator models (e.g., DeepSeek).
    • Human expert assessment where BDD practitioners rated relevance, completeness, and readability.

Results & Findings

ModelBest PromptText/Semantic ScoresHuman RatingLLM‑Evaluator Score
GPT‑4Zero‑shotHighest BLEU/ROUGE3.8 / 54.0 / 5
Claude 3Chain‑of‑thoughtSlightly lower BLEU/ROUGE4.4 / 5 (top)4.5 / 5
GeminiFew‑shotMid‑range similarity4.0 / 54.2 / 5
  • Claude 3 consistently earned the highest human and LLM‑evaluator scores, even though its raw text similarity lagged behind GPT‑4.
  • DeepSeek’s evaluator scores correlated more strongly with human judgments (ρ ≈ 0.78) than BLEU/ROUGE (ρ ≈ 0.45).
  • Input quality matters: Providing a detailed requirement description alone yields scenarios comparable to those generated from both user story + description; using only the terse user story leads to poor BDD output.
  • Parameter impact: Temperature = 0 and top_p = 1.0 reduced hallucinations and improved consistency across all models.

Practical Implications

  • Accelerated BDD authoring – Teams can seed a BDD suite by feeding detailed requirement docs to Claude 3 (or GPT‑4 with zero‑shot) and obtain high‑quality scenarios, cutting manual writing time by up to 60 % in pilot studies.
  • Prompt‑template libraries – The model‑specific prompting recipes can be packaged as reusable templates in CI/CD pipelines or IDE extensions, enabling “one‑click” scenario generation.
  • Quality control via LLM evaluators – Deploying a secondary LLM (e.g., DeepSeek) as an automated reviewer can flag low‑quality scenarios before they enter the test suite, reducing false positives in BDD test runs.
  • Tooling integration – The findings map cleanly onto existing BDD frameworks (Cucumber, Behave). Generated Gherkin files can be directly imported, allowing rapid iteration on acceptance criteria.
  • Cost‑effective scaling – Since deterministic settings (temp = 0) avoid the need for multiple sampling, organizations can keep API usage low while maintaining output quality.

Limitations & Future Work

  • Domain coverage – The dataset stems from four proprietary products; results may differ for domains with highly specialized vocabularies (e.g., embedded systems, finance).
  • Human evaluation size – Only a limited pool of BDD experts participated, which could bias the subjective scores.
  • Model updates – As LLMs evolve rapidly, the relative rankings could shift; continuous benchmarking is needed.
  • Future directions suggested by the authors include: expanding the dataset to open‑source projects, exploring multimodal prompts (e.g., diagrams), and integrating reinforcement learning from human feedback to fine‑tune LLMs specifically for BDD scenario generation.

Authors

  • Amila Rathnayake
  • Mojtaba Shahin
  • Golnoush Abaei

Paper Information

  • arXiv ID: 2603.04729v1
  • Categories: cs.SE
  • Published: March 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »