[Paper] DeepQuali: Initial results of a study on the use of large language models for assessing the quality of user stories

Published: (February 9, 2026 at 11:49 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.08887v1

Overview

The paper “DeepQuali: Initial results of a study on the use of large language models for assessing the quality of user stories” investigates whether a state‑of‑the‑art LLM (GPT‑4o) can reliably evaluate and improve the quality of agile user stories. By comparing the model’s judgments with those of seasoned requirements engineers in two small companies, the authors show that LLM‑driven quality checks can be both accurate and actionable—opening a new avenue for AI‑assisted requirements engineering.

Key Contributions

  • DeepQuali prototype: an end‑to‑end pipeline that feeds user stories to GPT‑4o, obtains a structured quality rating (e.g., completeness, testability, clarity) and a natural‑language explanation, and suggests concrete improvements.
  • Empirical validation: a field study with real‑world projects where DeepQuali’s assessments were benchmarked against expert evaluations.
  • User‑acceptance insights: qualitative feedback from requirements engineers on the usefulness, trustworthiness, and workflow integration of the LLM‑based approach.
  • Evidence that explicit quality models + explanatory feedback boost acceptance: the study demonstrates that when the LLM references a known quality framework (e.g., INVEST) and explains its reasoning, practitioners are more likely to trust its output.

Methodology

  1. Quality Model Definition – The authors adopted the widely‑used INVEST criteria (Independent, Negotiable, Valuable, Estimable, Small, Testable) and broke each criterion into measurable sub‑aspects.
  2. Prompt Engineering – Carefully crafted prompts were designed to (a) request a numeric rating (1‑5) for each sub‑aspect, (b) ask for a concise justification, and (c) solicit improvement suggestions.
  3. Data Collection – Two small software firms supplied a total of 120 user stories from ongoing agile projects.
  4. Expert Baseline – Three senior requirements engineers independently rated the same stories using the same quality model, providing a “gold‑standard” reference.
  5. Comparison & Analysis – Agreement between LLM and experts was measured with Cohen’s κ for categorical ratings and Pearson’s r for overall scores. Follow‑up walkthrough sessions captured qualitative reactions and acceptance levels.

Results & Findings

  • High overall agreement: LLM’s aggregate quality scores matched expert consensus with a Pearson correlation of 0.78.
  • Strong alignment on explanations: Experts rated the LLM’s rationales as “clear and helpful” in 84 % of cases.
  • Variability on fine‑grained ratings: For individual sub‑criteria (e.g., “testability”), agreement dropped to κ ≈ 0.45, indicating that personal experience influences nuanced judgments.
  • Positive perceived usefulness: 71 % of participants said DeepQuali would be “useful in daily work” if integrated into their tooling.
  • Workflow friction: The main criticism was the lack of seamless integration with existing backlog management tools (e.g., Jira, Azure Boards).

Practical Implications

  • Instant quality gate – Teams can run DeepQuali as a pre‑commit check on new user stories, catching vague or non‑testable items before they enter sprint planning.
  • Learning aid for junior analysts – The model’s explanations double as on‑the‑fly coaching, helping newcomers internalize good story‑writing practices.
  • Reduced review overhead – By surfacing the most problematic stories automatically, human reviewers can focus their limited time on high‑impact items.
  • Toolchain integration roadmap – Embedding the LLM as a plugin for popular issue‑trackers (via REST APIs) would turn the prototype into a production‑ready assistant.
  • Scalable quality governance – Organizations with distributed teams can enforce a consistent quality baseline without needing a central expert panel for every backlog.

Limitations & Future Work

  • Sample size & domain diversity – The study involved only two small companies and a limited set of domains; broader validation across enterprise‑scale projects is needed.
  • Prompt sensitivity – Results depend heavily on prompt phrasing; systematic prompt‑optimization techniques were not explored.
  • Explainability depth – While explanations were helpful, they remain surface‑level; future work could integrate traceability to the underlying quality model.
  • Integration & automation – The authors plan to develop native plugins for Jira/Azure Boards and to evaluate continuous‑integration pipelines that automatically flag low‑quality stories.

Bottom line: DeepQuali shows that LLMs like GPT‑4o can become trustworthy allies in the often‑overlooked realm of requirements quality, offering both rapid assessments and actionable guidance—provided they’re woven into the developers’ everyday workflow.

Authors

  • Adam Trendowicz
  • Daniel Seifert
  • Andreas Jedlitschka
  • Marcus Ciolkowski
  • Anton Strahilov

Paper Information

  • arXiv ID: 2602.08887v1
  • Categories: cs.SE, cs.AI
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »