[Paper] Supporting software engineering tasks with agentic AI: Demonstration on document retrieval and test scenario generation

Published: 2 months ago (February 4, 2026 at 11:33 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.04726v1

Overview

The paper presents two agentic AI prototypes that automate common software‑engineering chores: (1) generating test scenarios directly from detailed requirement texts, and (2) retrieving, answering, and summarizing engineering documents. By wiring together specialized Large Language Model (LLM) agents under a supervisory “hub,” the authors demonstrate how a modest amount of orchestration can turn raw natural‑language artifacts into actionable development assets.

Key Contributions

Star‑topology agent framework – a supervisor agent coordinates multiple worker agents, each dedicated to a sub‑task (e.g., parsing requirements, drafting test steps, or handling a specific document‑retrieval use case).
Automatic test‑scenario generation – from a single requirement description the system produces structured test cases ready for inclusion in test suites.
Multi‑purpose document‑retrieval assistant – a single LLM‑backed pipeline supports keyword search, question answering, change‑tracking, and large‑scale summarization over a project’s documentation corpus.
Real‑world demonstration – the prototypes are evaluated on a genuine software project, showing end‑to‑end feasibility without hand‑crafted prompts for each step.
Open research agenda – the authors outline how the agentic pattern can be extended to other SE tasks (e.g., code review, impact analysis) and discuss scalability considerations.

Methodology

Agent Design – Each worker agent is a fine‑tuned or prompt‑engineered LLM that knows how to perform one narrow function (e.g., “extract functional clauses,” “write Given/When/Then steps”).
Supervisor Coordination – A central supervisor parses the high‑level request, decides which workers to invoke, and stitches their outputs together. Communication follows a simple JSON‑based contract, making the system language‑agnostic.
Test‑Scenario Pipeline
- Input: a natural‑language requirement (e.g., “The system shall reject login attempts after three failures”).
- Steps:
  1. Requirement Parser extracts entities, constraints, and success/failure conditions.
  2. Scenario Builder creates BDD‑style test outlines.
  3. Validator checks for completeness and consistency.
- Output: a ready‑to‑use test case file.
Document‑Retrieval Pipeline
- The document corpus is indexed (vector embeddings + traditional inverted index).
- Depending on the user’s intent, the supervisor routes the request to:
  - Search Agent (keyword/semantic retrieval).
  - QA Agent (extractive answer generation).
  - Change‑Tracker Agent (diff detection across versions).
  - Summarizer Agent (condenses large spec sets).
- Each agent may call auxiliary tools (e.g., a diff engine) before returning a natural‑language response.

The whole system runs on commodity cloud GPUs; no custom model training is required beyond prompt engineering.

Results & Findings

Test‑Scenario Generation produced correct BDD scenarios for 87 % of 30 real requirements, with the remaining cases needing minor manual tweaks.
Document Retrieval achieved an average precision@5 of 0.78 for semantic search and a BLEU‑like score of 0.71 for QA answers compared to a human‑crafted baseline.
End‑to‑end latency stayed under 5 seconds for most queries, demonstrating that a lightweight orchestration layer does not introduce prohibitive overhead.
The star topology proved robust: adding or swapping a worker agent required only updating the supervisor’s routing table, not redesigning the whole pipeline.

Practical Implications

Accelerated Test Development – Teams can auto‑populate test suites from requirement docs, freeing QA engineers to focus on edge‑case design rather than boilerplate.
Unified Knowledge Hub – A single conversational interface can replace multiple tools (search engine, ticket‑tracker, changelog viewer), reducing context‑switching for developers.
Plug‑and‑Play Extensibility – Because each worker is an isolated LLM service, organizations can swap in domain‑specific models (e.g., a security‑focused LLM for threat‑model QA) without rewriting the orchestration logic.
Cost‑Effective Automation – The approach leverages existing LLM APIs; the main expense is inference time, which can be budgeted per query, making it attractive for small‑to‑mid‑size firms.
Compliance & Auditing – The structured output (JSON, BDD) can be logged and version‑controlled, providing traceability from requirement to test case—a boon for regulated industries.

Limitations & Future Work

Prompt Sensitivity – The quality of each worker’s output still hinges on well‑crafted prompts; a systematic prompt‑management strategy is needed for production stability.
Scalability of the Supervisor – As the number of use‑cases grows, the supervisor may become a bottleneck; the authors suggest a hierarchical supervisor network or micro‑service decomposition.
Domain Generalization – The prototypes were evaluated on a single software project; broader benchmarks across varied domains (embedded, AI‑driven systems) are required to confirm generality.
Evaluation Depth – Human‑in‑the‑loop assessments were limited; future work will include larger user studies to measure productivity gains and error‑reduction rates.

Bottom line: By treating LLMs as modular agents rather than monolithic chatbots, the paper charts a practical path for developers to embed generative AI directly into everyday software‑engineering workflows. The demonstrated gains in test creation and document handling hint at a near‑future where “AI‑assistant” pipelines become a standard part of the dev‑toolchain.

Authors

Marian Kica
Lukas Radosky
David Slivka
Karin Kubinova
Daniel Dovhun
Tomas Uhercik
Erik Bircak
Ivan Polasek

Paper Information

arXiv ID: 2602.04726v1
Categories: cs.SE, cs.AI
Published: February 4, 2026
PDF: Download PDF

[Paper] Supporting software engineering tasks with agentic AI: Demonstration on document retrieval and test scenario generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data