[Paper] Supporting software engineering tasks with agentic AI: Demonstration on document retrieval and test scenario generation
Source: arXiv - 2602.04726v1
Overview
The paper presents two agentic AI prototypes that automate common software‑engineering chores: (1) generating test scenarios directly from detailed requirement texts, and (2) retrieving, answering, and summarizing engineering documents. By wiring together specialized Large Language Model (LLM) agents under a supervisory “hub,” the authors demonstrate how a modest amount of orchestration can turn raw natural‑language artifacts into actionable development assets.
Key Contributions
- Star‑topology agent framework – a supervisor agent coordinates multiple worker agents, each dedicated to a sub‑task (e.g., parsing requirements, drafting test steps, or handling a specific document‑retrieval use case).
- Automatic test‑scenario generation – from a single requirement description the system produces structured test cases ready for inclusion in test suites.
- Multi‑purpose document‑retrieval assistant – a single LLM‑backed pipeline supports keyword search, question answering, change‑tracking, and large‑scale summarization over a project’s documentation corpus.
- Real‑world demonstration – the prototypes are evaluated on a genuine software project, showing end‑to‑end feasibility without hand‑crafted prompts for each step.
- Open research agenda – the authors outline how the agentic pattern can be extended to other SE tasks (e.g., code review, impact analysis) and discuss scalability considerations.
Methodology
- Agent Design – Each worker agent is a fine‑tuned or prompt‑engineered LLM that knows how to perform one narrow function (e.g., “extract functional clauses,” “write Given/When/Then steps”).
- Supervisor Coordination – A central supervisor parses the high‑level request, decides which workers to invoke, and stitches their outputs together. Communication follows a simple JSON‑based contract, making the system language‑agnostic.
- Test‑Scenario Pipeline
- Input: a natural‑language requirement (e.g., “The system shall reject login attempts after three failures”).
- Steps:
- Requirement Parser extracts entities, constraints, and success/failure conditions.
- Scenario Builder creates BDD‑style test outlines.
- Validator checks for completeness and consistency.
- Output: a ready‑to‑use test case file.
- Document‑Retrieval Pipeline
- The document corpus is indexed (vector embeddings + traditional inverted index).
- Depending on the user’s intent, the supervisor routes the request to:
- Search Agent (keyword/semantic retrieval).
- QA Agent (extractive answer generation).
- Change‑Tracker Agent (diff detection across versions).
- Summarizer Agent (condenses large spec sets).
- Each agent may call auxiliary tools (e.g., a diff engine) before returning a natural‑language response.
The whole system runs on commodity cloud GPUs; no custom model training is required beyond prompt engineering.
Results & Findings
- Test‑Scenario Generation produced correct BDD scenarios for 87 % of 30 real requirements, with the remaining cases needing minor manual tweaks.
- Document Retrieval achieved an average precision@5 of 0.78 for semantic search and a BLEU‑like score of 0.71 for QA answers compared to a human‑crafted baseline.
- End‑to‑end latency stayed under 5 seconds for most queries, demonstrating that a lightweight orchestration layer does not introduce prohibitive overhead.
- The star topology proved robust: adding or swapping a worker agent required only updating the supervisor’s routing table, not redesigning the whole pipeline.
Practical Implications
- Accelerated Test Development – Teams can auto‑populate test suites from requirement docs, freeing QA engineers to focus on edge‑case design rather than boilerplate.
- Unified Knowledge Hub – A single conversational interface can replace multiple tools (search engine, ticket‑tracker, changelog viewer), reducing context‑switching for developers.
- Plug‑and‑Play Extensibility – Because each worker is an isolated LLM service, organizations can swap in domain‑specific models (e.g., a security‑focused LLM for threat‑model QA) without rewriting the orchestration logic.
- Cost‑Effective Automation – The approach leverages existing LLM APIs; the main expense is inference time, which can be budgeted per query, making it attractive for small‑to‑mid‑size firms.
- Compliance & Auditing – The structured output (JSON, BDD) can be logged and version‑controlled, providing traceability from requirement to test case—a boon for regulated industries.
Limitations & Future Work
- Prompt Sensitivity – The quality of each worker’s output still hinges on well‑crafted prompts; a systematic prompt‑management strategy is needed for production stability.
- Scalability of the Supervisor – As the number of use‑cases grows, the supervisor may become a bottleneck; the authors suggest a hierarchical supervisor network or micro‑service decomposition.
- Domain Generalization – The prototypes were evaluated on a single software project; broader benchmarks across varied domains (embedded, AI‑driven systems) are required to confirm generality.
- Evaluation Depth – Human‑in‑the‑loop assessments were limited; future work will include larger user studies to measure productivity gains and error‑reduction rates.
Bottom line: By treating LLMs as modular agents rather than monolithic chatbots, the paper charts a practical path for developers to embed generative AI directly into everyday software‑engineering workflows. The demonstrated gains in test creation and document handling hint at a near‑future where “AI‑assistant” pipelines become a standard part of the dev‑toolchain.
Authors
- Marian Kica
- Lukas Radosky
- David Slivka
- Karin Kubinova
- Daniel Dovhun
- Tomas Uhercik
- Erik Bircak
- Ivan Polasek
Paper Information
- arXiv ID: 2602.04726v1
- Categories: cs.SE, cs.AI
- Published: February 4, 2026
- PDF: Download PDF