[Paper] Supporting software engineering tasks with agentic AI: Demonstration on document retrieval and test scenario generation

Published: (February 4, 2026 at 11:33 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.04726v1

Overview

The paper presents two agentic AI prototypes that automate common software‑engineering chores: (1) generating test scenarios directly from detailed requirement texts, and (2) retrieving, answering, and summarizing engineering documents. By wiring together specialized Large Language Model (LLM) agents under a supervisory “hub,” the authors demonstrate how a modest amount of orchestration can turn raw natural‑language artifacts into actionable development assets.

Key Contributions

  • Star‑topology agent framework – a supervisor agent coordinates multiple worker agents, each dedicated to a sub‑task (e.g., parsing requirements, drafting test steps, or handling a specific document‑retrieval use case).
  • Automatic test‑scenario generation – from a single requirement description the system produces structured test cases ready for inclusion in test suites.
  • Multi‑purpose document‑retrieval assistant – a single LLM‑backed pipeline supports keyword search, question answering, change‑tracking, and large‑scale summarization over a project’s documentation corpus.
  • Real‑world demonstration – the prototypes are evaluated on a genuine software project, showing end‑to‑end feasibility without hand‑crafted prompts for each step.
  • Open research agenda – the authors outline how the agentic pattern can be extended to other SE tasks (e.g., code review, impact analysis) and discuss scalability considerations.

Methodology

  1. Agent Design – Each worker agent is a fine‑tuned or prompt‑engineered LLM that knows how to perform one narrow function (e.g., “extract functional clauses,” “write Given/When/Then steps”).
  2. Supervisor Coordination – A central supervisor parses the high‑level request, decides which workers to invoke, and stitches their outputs together. Communication follows a simple JSON‑based contract, making the system language‑agnostic.
  3. Test‑Scenario Pipeline
    • Input: a natural‑language requirement (e.g., “The system shall reject login attempts after three failures”).
    • Steps:
      1. Requirement Parser extracts entities, constraints, and success/failure conditions.
      2. Scenario Builder creates BDD‑style test outlines.
      3. Validator checks for completeness and consistency.
    • Output: a ready‑to‑use test case file.
  4. Document‑Retrieval Pipeline
    • The document corpus is indexed (vector embeddings + traditional inverted index).
    • Depending on the user’s intent, the supervisor routes the request to:
      • Search Agent (keyword/semantic retrieval).
      • QA Agent (extractive answer generation).
      • Change‑Tracker Agent (diff detection across versions).
      • Summarizer Agent (condenses large spec sets).
    • Each agent may call auxiliary tools (e.g., a diff engine) before returning a natural‑language response.

The whole system runs on commodity cloud GPUs; no custom model training is required beyond prompt engineering.

Results & Findings

  • Test‑Scenario Generation produced correct BDD scenarios for 87 % of 30 real requirements, with the remaining cases needing minor manual tweaks.
  • Document Retrieval achieved an average precision@5 of 0.78 for semantic search and a BLEU‑like score of 0.71 for QA answers compared to a human‑crafted baseline.
  • End‑to‑end latency stayed under 5 seconds for most queries, demonstrating that a lightweight orchestration layer does not introduce prohibitive overhead.
  • The star topology proved robust: adding or swapping a worker agent required only updating the supervisor’s routing table, not redesigning the whole pipeline.

Practical Implications

  • Accelerated Test Development – Teams can auto‑populate test suites from requirement docs, freeing QA engineers to focus on edge‑case design rather than boilerplate.
  • Unified Knowledge Hub – A single conversational interface can replace multiple tools (search engine, ticket‑tracker, changelog viewer), reducing context‑switching for developers.
  • Plug‑and‑Play Extensibility – Because each worker is an isolated LLM service, organizations can swap in domain‑specific models (e.g., a security‑focused LLM for threat‑model QA) without rewriting the orchestration logic.
  • Cost‑Effective Automation – The approach leverages existing LLM APIs; the main expense is inference time, which can be budgeted per query, making it attractive for small‑to‑mid‑size firms.
  • Compliance & Auditing – The structured output (JSON, BDD) can be logged and version‑controlled, providing traceability from requirement to test case—a boon for regulated industries.

Limitations & Future Work

  • Prompt Sensitivity – The quality of each worker’s output still hinges on well‑crafted prompts; a systematic prompt‑management strategy is needed for production stability.
  • Scalability of the Supervisor – As the number of use‑cases grows, the supervisor may become a bottleneck; the authors suggest a hierarchical supervisor network or micro‑service decomposition.
  • Domain Generalization – The prototypes were evaluated on a single software project; broader benchmarks across varied domains (embedded, AI‑driven systems) are required to confirm generality.
  • Evaluation Depth – Human‑in‑the‑loop assessments were limited; future work will include larger user studies to measure productivity gains and error‑reduction rates.

Bottom line: By treating LLMs as modular agents rather than monolithic chatbots, the paper charts a practical path for developers to embed generative AI directly into everyday software‑engineering workflows. The demonstrated gains in test creation and document handling hint at a near‑future where “AI‑assistant” pipelines become a standard part of the dev‑toolchain.

Authors

  • Marian Kica
  • Lukas Radosky
  • David Slivka
  • Karin Kubinova
  • Daniel Dovhun
  • Tomas Uhercik
  • Erik Bircak
  • Ivan Polasek

Paper Information

  • arXiv ID: 2602.04726v1
  • Categories: cs.SE, cs.AI
  • Published: February 4, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »