[Paper] Enhancing LLM-Based Test Generation by Eliminating Covered Code
Source: arXiv - 2602.21997v1
Overview
The paper introduces a new way to generate unit tests for large, real‑world methods using Large Language Models (LLMs). By repeatedly stripping out code that’s already been exercised, the authors keep the LLM’s prompt short and focused, which dramatically improves coverage on complex codebases.
Key Contributions
- Context‑aware retrieval: Combines static analysis with LLM prompts to pull in only the most relevant surrounding code and API information.
- Iterative test generation + code elimination: After each test batch, the covered statements are removed from the target slice, shrinking the problem size for the next iteration.
- Scalable pipeline: Works on multi‑thousand‑line methods where prior LLM‑based generators hit token‑limit or reasoning‑breakdown walls.
- Empirical superiority: Outperforms both state‑of‑the‑art LLM test generators (e.g., Codex‑based tools) and classic search‑based generators (e.g., EvoSuite) on several open‑source projects.
Methodology
-
Static‑analysis‑driven context extraction
- The tool parses the target method, builds a call‑graph slice, and selects the most influential variables, types, and dependent functions.
- This slice is fed to an LLM (e.g., GPT‑4) together with a “write unit tests” prompt, keeping the token count well under the model’s limit.
-
Iterative generation loop
- Generate: The LLM produces a set of test cases for the current slice.
- Execute & measure: The generated tests run against the original code; a coverage tool (e.g., JaCoCo, coverage.py) records which lines/branches were hit.
- Eliminate: All statements already covered are removed from the slice (or marked as “done”).
- Repeat: The reduced slice becomes the new prompt input. The loop stops when coverage plateaus or the slice is empty.
The authors also add a lightweight “fallback” path that re‑asks the LLM for harder‑to‑cover branches after a few iterations, ensuring diminishing returns are caught early.
Results & Findings
| Benchmark | Baseline (LLM‑only) | Baseline (Search‑based) | Proposed Method |
|---|---|---|---|
| Avg. line coverage (complex methods) | 58 % | 62 % | 78 % |
| Avg. branch coverage | 45 % | 51 % | 70 % |
| Tokens per prompt (avg.) | 3,200 | N/A | 1,100 |
| Generation time per method | 12 s | 45 s | 9 s |
- The elimination step cuts the prompt size by ~65 % after the first iteration, keeping the LLM’s reasoning sharp.
- Coverage gains are especially pronounced for methods with deep call chains or heavy use of external libraries.
- The approach remains compatible with any off‑the‑shelf LLM that supports code generation; the authors demonstrate it with both GPT‑4 and Claude.
Practical Implications
- Developer tooling: IDE plugins could embed this pipeline to auto‑generate high‑coverage unit tests for newly written or refactored methods, reducing manual test‑writing effort.
- CI/CD integration: Running the iterative generator as a nightly job can surface uncovered edge cases before they reach production, improving regression safety.
- Cost efficiency: By shrinking prompt size, the method lowers API usage fees for commercial LLM services—a tangible saving for large teams.
- Legacy code revitalization: Teams tasked with adding tests to old, monolithic codebases can now tackle methods that were previously “too big” for LLM‑assisted generation.
Limitations & Future Work
- Static analysis precision: The slice extraction relies on accurate call‑graph construction; dynamic language features (e.g., reflection) can cause missed dependencies.
- Test quality vs. coverage: The generated tests achieve high coverage but may lack meaningful assertions or readable naming; post‑processing or human review is still needed.
- Scalability to whole projects: The current evaluation focuses on individual complex methods; extending the loop to whole‑module or full‑application testing remains an open challenge.
- Model‑agnostic tuning: Future work could explore adaptive prompt engineering that automatically selects the optimal context size based on the LLM’s observed performance on a given codebase.
Bottom line: By iteratively pruning already‑covered code, the authors turn LLMs into a focused, high‑coverage test generator that scales to the kind of messy, inter‑dependent methods found in production software. For developers looking to boost test automation without sacrificing cost or speed, this technique offers a compelling new tool in the QA toolbox.
Authors
- WeiZhe Xu
- Mengyu Liu
- Fanxin Kong
Paper Information
- arXiv ID: 2602.21997v1
- Categories: cs.SE, cs.AI, cs.LG
- Published: February 25, 2026
- PDF: Download PDF