[Paper] LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

Published: (November 25, 2025 at 10:33 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.20403v2

Overview

The paper presents AgoneTest, a framework that lets developers and researchers evaluate how well large language models (LLMs) can automatically generate Java unit tests. Instead of inventing a new test‑generation algorithm, the authors focus on providing a reproducible, end‑to‑end pipeline—including a curated dataset and advanced quality metrics—so the community can fairly compare different LLMs and prompting techniques under realistic development conditions.

Key Contributions

  • AgoneTest framework: a plug‑and‑play evaluation harness that runs LLM‑generated test code through compilation, execution, coverage analysis, mutation testing, and test‑smell detection.
  • Classes2Test dataset: a curated collection of Java classes paired with their human‑written JUnit test suites, serving as a common benchmark for generation experiments.
  • Comprehensive metrics suite: beyond line/branch coverage, the authors integrate mutation score (defect detection power) and test‑smell analysis to assess test quality rather than just quantity.
  • Empirical study: systematic comparison of several LLMs (e.g., GPT‑4, Claude, LLaMA‑2) and prompting strategies (plain description vs. few‑shot examples) on the dataset.
  • Open‑source release: all code, dataset, and evaluation scripts are publicly available, encouraging reproducibility and community extensions.

Methodology

  1. Dataset preparation – The authors extracted 1,200 Java classes from open‑source projects and paired each with its existing JUnit test class, forming the Classes2Test benchmark.
  2. Prompt design – For each target class, they crafted multiple prompts: a minimal description, a detailed specification, and a few‑shot version that includes a small example of a class‑test pair.
  3. Test generation – The selected LLMs receive the prompts and output candidate test files.
  4. Automated pipeline – AgoneTest compiles the generated tests, runs them against the original code, and collects:
    • Compilation success rate
    • Code coverage (line/branch)
    • Mutation score (using PIT) to gauge defect detection
    • Test‑smell detection (e.g., flaky tests, duplicated asserts) via SonarQube rules
  5. Comparison – Results are aggregated and compared against the human‑written baseline and across prompting strategies.

Results & Findings

Metric (on compilable tests)Human baselineBest LLM (GPT‑4, few‑shot)
Line coverage78 %81 %
Branch coverage65 %68 %
Mutation score52 %55 %
Compilation success rate100 %71 %
Test‑smell density (per 100 LOC)3.23.8 (slightly higher)
  • Coverage & defect detection: When the generated tests compile, they often surpass human tests in raw coverage and mutation score.
  • Prompt impact: Few‑shot prompts consistently outperform plain descriptions, boosting both compilation rates and quality metrics.
  • Compilation bottleneck: A sizable fraction of LLM‑generated tests fail to compile, highlighting the need for better syntax‑aware prompting or post‑processing.
  • Test‑smell trade‑off: LLM tests tend to contain more minor smells (e.g., duplicated assertions), suggesting that raw generation still needs refinement.

Practical Implications

  • Accelerated test scaffolding: Teams can use LLMs to draft initial test suites, especially for legacy code lacking coverage, then manually polish the failing or smelly parts.
  • Prompt engineering as a skill: The study shows that a few well‑chosen examples dramatically improve outcomes, encouraging developers to treat prompt design as part of the testing workflow.
  • Continuous integration (CI) hooks: AgoneTest can be integrated into CI pipelines to automatically generate regression tests when new classes are added, providing a safety net while developers write production code.
  • Benchmarking new models: Because the framework is model‑agnostic, organizations can plug in their own proprietary LLMs and obtain comparable metrics before committing to a commercial solution.
  • Education & onboarding: New hires can see instantly generated test examples for unfamiliar codebases, shortening the learning curve.

Limitations & Future Work

  • Compilation failures: Over a quarter of generated tests do not compile, limiting the practical utility of raw LLM output.
  • Dataset scope: Classes2Test focuses on medium‑sized, well‑documented open‑source projects; results may differ for highly domain‑specific or poorly documented code.
  • Metric granularity: Mutation testing captures defect detection but does not assess test readability or maintainability, which remain open challenges.
  • Future directions suggested by the authors include: integrating syntax‑aware post‑processors (e.g., static analysis fixers), expanding the benchmark to other languages and testing frameworks, and exploring reinforcement‑learning‑based prompting to reduce test smells automatically.

Authors

  • Andrea Lops
  • Fedelucio Narducci
  • Azzurra Ragone
  • Michelantonio Trizio
  • Claudio Bartolini

Paper Information

  • arXiv ID: 2511.20403v2
  • Categories: cs.SE, cs.AI
  • Published: November 25, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »