[Paper] Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation
Source: arXiv - 2601.10942v1
Overview
Developers constantly push new code via pull requests (PRs), but even projects with massive test suites often leave the lines changed in a PR untested—a “last‑mile” regression gap. The paper Change And Cover (ChaCo) proposes an LLM‑driven tool that automatically generates focused tests for exactly those newly‑added or modified lines, stitching the new tests seamlessly into the existing suite.
Key Contributions
- PR‑aware test augmentation – ChaCo measures patch coverage (coverage of the lines touched by a PR) and generates tests only for the uncovered parts, keeping developers’ attention on the code they just wrote.
- Context‑rich prompt engineering – The authors devise two techniques to harvest relevant test artefacts (nearby test functions, fixtures, data generators) and feed them to the LLM, dramatically improving the relevance of generated tests.
- Style‑conscious integration – ChaCo adapts the generated test’s structure, naming, and import style to match the surrounding test files and produces a concise summary for code‑review.
- Empirical validation – On 145 PRs from SciPy, Qiskit, and Pandas, ChaCo raises patch coverage to 100 % for 30 % of PRs, at an average cost of $0.11 per PR. Human reviewers rate the added tests highly (≈4.5/5).
- Real‑world impact – 8 of the 12 generated tests have already been merged upstream, and the tool uncovered two previously unknown bugs.
Methodology
- Patch Coverage Analysis – When a PR lands, ChaCo computes which lines in the diff are not exercised by the existing test suite.
- Context Extraction –
- Local test context: Scans the repository for test files that touch the same modules, extracting helper functions, fixtures, and data‑generation utilities.
- Semantic similarity: Uses lightweight static analysis to find test code that shares identifiers or types with the changed code.
- Prompt Construction – The extracted context, the PR diff, and a short instruction (“write a unit test that covers the highlighted lines”) are combined into a prompt for a large language model (e.g., GPT‑4).
- Test Generation & Post‑Processing – The LLM’s output is parsed, linted, and reformatted to follow the project’s style guidelines. A short markdown summary (what the test does, why it matters) is attached for the reviewer.
- CI Integration – The generated test file is added to the PR automatically; CI runs the full suite to verify that coverage improves and no regressions are introduced.
Results & Findings
| Metric | Value |
|---|---|
| PRs achieving full patch coverage | 30 % (44 / 145) |
| Average cost per PR (LLM API usage) | $0.11 |
| Human‑reviewer rating (usefulness) | 4.53 / 5 |
| Human‑reviewer rating (integration) | 4.2 / 5 |
| Human‑reviewer rating (relevance to PR) | 4.7 / 5 |
| Tests merged upstream | 8 / 12 |
| New bugs discovered | 2 |
Ablation studies show that including test context doubles coverage compared to a naïve “diff‑only” prompt. Without context, the LLM often produces generic or non‑compiling tests.
Practical Implications
- CI‑first safety net – Teams can plug ChaCo into their continuous‑integration pipelines to automatically close the last‑mile testing gap before a PR is merged, reducing the chance of regressions slipping through.
- Developer productivity – Instead of manually hunting for missing tests, developers receive ready‑to‑review test files that match the project’s coding style, cutting down review friction.
- Cost‑effective quality assurance – At roughly a dime per PR, the approach is cheaper than hiring additional QA engineers or running heavyweight symbolic execution tools.
- Bug discovery – The tool’s focus on newly‑changed code surfaces edge‑case failures that existing tests miss, as demonstrated by the two novel bugs found in the evaluation.
- Language‑agnostic potential – While evaluated on Python scientific libraries, the same workflow (patch coverage → context extraction → LLM prompt) can be adapted to other ecosystems (JavaScript, Java, Rust) with appropriate test‑context parsers.
Limitations & Future Work
- LLM reliability – Generated tests sometimes contain flaky assertions or rely on external resources; a more robust post‑generation validation step is needed.
- Context extraction heuristics – Current static‑analysis heuristics work well for Python but may miss nuanced fixtures in other languages or frameworks.
- Scalability to massive PRs – Extremely large diffs can overwhelm the prompt length limits of current LLM APIs; chunking strategies are an open problem.
- Security considerations – Auto‑generated test code runs in the CI environment; safeguards against malicious payloads (e.g., network calls) must be enforced.
- User control – Future versions could let developers specify coverage targets, test style preferences, or exclude certain modules from augmentation.
By addressing these challenges, ChaCo could become a staple of modern CI pipelines, turning the “last mile” of regression testing from a manual chore into an automated, low‑cost safety net.
Authors
- Zitong Zhou
- Matteo Paltenghi
- Miryung Kim
- Michael Pradel
Paper Information
- arXiv ID: 2601.10942v1
- Categories: cs.SE
- Published: January 16, 2026
- PDF: Download PDF