[Paper] Enhancing LLM-Based Test Generation by Eliminating Covered Code

Published: 3 days ago (February 25, 2026 at 10:16 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21997v1

Overview

The paper introduces a new way to generate unit tests for large, real‑world methods using Large Language Models (LLMs). By repeatedly stripping out code that’s already been exercised, the authors keep the LLM’s prompt short and focused, which dramatically improves coverage on complex codebases.

Key Contributions

Context‑aware retrieval: Combines static analysis with LLM prompts to pull in only the most relevant surrounding code and API information.
Iterative test generation + code elimination: After each test batch, the covered statements are removed from the target slice, shrinking the problem size for the next iteration.
Scalable pipeline: Works on multi‑thousand‑line methods where prior LLM‑based generators hit token‑limit or reasoning‑breakdown walls.
Empirical superiority: Outperforms both state‑of‑the‑art LLM test generators (e.g., Codex‑based tools) and classic search‑based generators (e.g., EvoSuite) on several open‑source projects.

Methodology

Static‑analysis‑driven context extraction
- The tool parses the target method, builds a call‑graph slice, and selects the most influential variables, types, and dependent functions.
- This slice is fed to an LLM (e.g., GPT‑4) together with a “write unit tests” prompt, keeping the token count well under the model’s limit.
Iterative generation loop
- Generate: The LLM produces a set of test cases for the current slice.
- Execute & measure: The generated tests run against the original code; a coverage tool (e.g., JaCoCo, coverage.py) records which lines/branches were hit.
- Eliminate: All statements already covered are removed from the slice (or marked as “done”).
- Repeat: The reduced slice becomes the new prompt input. The loop stops when coverage plateaus or the slice is empty.

The authors also add a lightweight “fallback” path that re‑asks the LLM for harder‑to‑cover branches after a few iterations, ensuring diminishing returns are caught early.

Results & Findings

Benchmark	Baseline (LLM‑only)	Baseline (Search‑based)	Proposed Method
Avg. line coverage (complex methods)	58 %	62 %	78 %
Avg. branch coverage	45 %	51 %	70 %
Tokens per prompt (avg.)	3,200	N/A	1,100
Generation time per method	12 s	45 s	9 s

The elimination step cuts the prompt size by ~65 % after the first iteration, keeping the LLM’s reasoning sharp.
Coverage gains are especially pronounced for methods with deep call chains or heavy use of external libraries.
The approach remains compatible with any off‑the‑shelf LLM that supports code generation; the authors demonstrate it with both GPT‑4 and Claude.

Practical Implications

Developer tooling: IDE plugins could embed this pipeline to auto‑generate high‑coverage unit tests for newly written or refactored methods, reducing manual test‑writing effort.
CI/CD integration: Running the iterative generator as a nightly job can surface uncovered edge cases before they reach production, improving regression safety.
Cost efficiency: By shrinking prompt size, the method lowers API usage fees for commercial LLM services—a tangible saving for large teams.
Legacy code revitalization: Teams tasked with adding tests to old, monolithic codebases can now tackle methods that were previously “too big” for LLM‑assisted generation.

Limitations & Future Work

Static analysis precision: The slice extraction relies on accurate call‑graph construction; dynamic language features (e.g., reflection) can cause missed dependencies.
Test quality vs. coverage: The generated tests achieve high coverage but may lack meaningful assertions or readable naming; post‑processing or human review is still needed.
Scalability to whole projects: The current evaluation focuses on individual complex methods; extending the loop to whole‑module or full‑application testing remains an open challenge.
Model‑agnostic tuning: Future work could explore adaptive prompt engineering that automatically selects the optimal context size based on the LLM’s observed performance on a given codebase.

Bottom line: By iteratively pruning already‑covered code, the authors turn LLMs into a focused, high‑coverage test generator that scales to the kind of messy, inter‑dependent methods found in production software. For developers looking to boost test automation without sacrificing cost or speed, this technique offers a compelling new tool in the QA toolbox.

Authors

WeiZhe Xu
Mengyu Liu
Fanxin Kong

Paper Information

arXiv ID: 2602.21997v1
Categories: cs.SE, cs.AI, cs.LG
Published: February 25, 2026
PDF: Download PDF

[Paper] Enhancing LLM-Based Test Generation by Eliminating Covered Code

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport