[Paper] LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

Published: 2 months ago (November 25, 2025 at 10:33 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.20403v2

Overview

The paper presents AgoneTest, a framework that lets developers and researchers evaluate how well large language models (LLMs) can automatically generate Java unit tests. Instead of inventing a new test‑generation algorithm, the authors focus on providing a reproducible, end‑to‑end pipeline—including a curated dataset and advanced quality metrics—so the community can fairly compare different LLMs and prompting techniques under realistic development conditions.

Key Contributions

AgoneTest framework: a plug‑and‑play evaluation harness that runs LLM‑generated test code through compilation, execution, coverage analysis, mutation testing, and test‑smell detection.
Classes2Test dataset: a curated collection of Java classes paired with their human‑written JUnit test suites, serving as a common benchmark for generation experiments.
Comprehensive metrics suite: beyond line/branch coverage, the authors integrate mutation score (defect detection power) and test‑smell analysis to assess test quality rather than just quantity.
Empirical study: systematic comparison of several LLMs (e.g., GPT‑4, Claude, LLaMA‑2) and prompting strategies (plain description vs. few‑shot examples) on the dataset.
Open‑source release: all code, dataset, and evaluation scripts are publicly available, encouraging reproducibility and community extensions.

Methodology

Dataset preparation – The authors extracted 1,200 Java classes from open‑source projects and paired each with its existing JUnit test class, forming the Classes2Test benchmark.
Prompt design – For each target class, they crafted multiple prompts: a minimal description, a detailed specification, and a few‑shot version that includes a small example of a class‑test pair.
Test generation – The selected LLMs receive the prompts and output candidate test files.
Automated pipeline – AgoneTest compiles the generated tests, runs them against the original code, and collects:
- Compilation success rate
- Code coverage (line/branch)
- Mutation score (using PIT) to gauge defect detection
- Test‑smell detection (e.g., flaky tests, duplicated asserts) via SonarQube rules
Comparison – Results are aggregated and compared against the human‑written baseline and across prompting strategies.

Results & Findings

Metric (on compilable tests)	Human baseline	Best LLM (GPT‑4, few‑shot)
Line coverage	78 %	81 %
Branch coverage	65 %	68 %
Mutation score	52 %	55 %
Compilation success rate	100 %	71 %
Test‑smell density (per 100 LOC)	3.2	3.8 (slightly higher)

Coverage & defect detection: When the generated tests compile, they often surpass human tests in raw coverage and mutation score.
Prompt impact: Few‑shot prompts consistently outperform plain descriptions, boosting both compilation rates and quality metrics.
Compilation bottleneck: A sizable fraction of LLM‑generated tests fail to compile, highlighting the need for better syntax‑aware prompting or post‑processing.
Test‑smell trade‑off: LLM tests tend to contain more minor smells (e.g., duplicated assertions), suggesting that raw generation still needs refinement.

Practical Implications

Accelerated test scaffolding: Teams can use LLMs to draft initial test suites, especially for legacy code lacking coverage, then manually polish the failing or smelly parts.
Prompt engineering as a skill: The study shows that a few well‑chosen examples dramatically improve outcomes, encouraging developers to treat prompt design as part of the testing workflow.
Continuous integration (CI) hooks: AgoneTest can be integrated into CI pipelines to automatically generate regression tests when new classes are added, providing a safety net while developers write production code.
Benchmarking new models: Because the framework is model‑agnostic, organizations can plug in their own proprietary LLMs and obtain comparable metrics before committing to a commercial solution.
Education & onboarding: New hires can see instantly generated test examples for unfamiliar codebases, shortening the learning curve.

Limitations & Future Work

Compilation failures: Over a quarter of generated tests do not compile, limiting the practical utility of raw LLM output.
Dataset scope: Classes2Test focuses on medium‑sized, well‑documented open‑source projects; results may differ for highly domain‑specific or poorly documented code.
Metric granularity: Mutation testing captures defect detection but does not assess test readability or maintainability, which remain open challenges.
Future directions suggested by the authors include: integrating syntax‑aware post‑processors (e.g., static analysis fixers), expanding the benchmark to other languages and testing frameworks, and exploring reinforcement‑learning‑based prompting to reduce test smells automatically.

Authors

Andrea Lops
Fedelucio Narducci
Azzurra Ragone
Michelantonio Trizio
Claudio Bartolini

Paper Information

arXiv ID: 2511.20403v2
Categories: cs.SE, cs.AI
Published: November 25, 2025
PDF: Download PDF

[Paper] LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval