[Paper] Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation

Published: 3 weeks ago (January 15, 2026 at 09:08 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.10942v1

Overview

Developers constantly push new code via pull requests (PRs), but even projects with massive test suites often leave the lines changed in a PR untested—a “last‑mile” regression gap. The paper Change And Cover (ChaCo) proposes an LLM‑driven tool that automatically generates focused tests for exactly those newly‑added or modified lines, stitching the new tests seamlessly into the existing suite.

Key Contributions

PR‑aware test augmentation – ChaCo measures patch coverage (coverage of the lines touched by a PR) and generates tests only for the uncovered parts, keeping developers’ attention on the code they just wrote.
Context‑rich prompt engineering – The authors devise two techniques to harvest relevant test artefacts (nearby test functions, fixtures, data generators) and feed them to the LLM, dramatically improving the relevance of generated tests.
Style‑conscious integration – ChaCo adapts the generated test’s structure, naming, and import style to match the surrounding test files and produces a concise summary for code‑review.
Empirical validation – On 145 PRs from SciPy, Qiskit, and Pandas, ChaCo raises patch coverage to 100 % for 30 % of PRs, at an average cost of $0.11 per PR. Human reviewers rate the added tests highly (≈4.5/5).
Real‑world impact – 8 of the 12 generated tests have already been merged upstream, and the tool uncovered two previously unknown bugs.

Methodology

Patch Coverage Analysis – When a PR lands, ChaCo computes which lines in the diff are not exercised by the existing test suite.
Context Extraction –
- Local test context: Scans the repository for test files that touch the same modules, extracting helper functions, fixtures, and data‑generation utilities.
- Semantic similarity: Uses lightweight static analysis to find test code that shares identifiers or types with the changed code.
Prompt Construction – The extracted context, the PR diff, and a short instruction (“write a unit test that covers the highlighted lines”) are combined into a prompt for a large language model (e.g., GPT‑4).
Test Generation & Post‑Processing – The LLM’s output is parsed, linted, and reformatted to follow the project’s style guidelines. A short markdown summary (what the test does, why it matters) is attached for the reviewer.
CI Integration – The generated test file is added to the PR automatically; CI runs the full suite to verify that coverage improves and no regressions are introduced.

Results & Findings

Metric	Value
PRs achieving full patch coverage	30 % (44 / 145)
Average cost per PR (LLM API usage)	$0.11
Human‑reviewer rating (usefulness)	4.53 / 5
Human‑reviewer rating (integration)	4.2 / 5
Human‑reviewer rating (relevance to PR)	4.7 / 5
Tests merged upstream	8 / 12
New bugs discovered	2

Ablation studies show that including test context doubles coverage compared to a naïve “diff‑only” prompt. Without context, the LLM often produces generic or non‑compiling tests.

Practical Implications

CI‑first safety net – Teams can plug ChaCo into their continuous‑integration pipelines to automatically close the last‑mile testing gap before a PR is merged, reducing the chance of regressions slipping through.
Developer productivity – Instead of manually hunting for missing tests, developers receive ready‑to‑review test files that match the project’s coding style, cutting down review friction.
Cost‑effective quality assurance – At roughly a dime per PR, the approach is cheaper than hiring additional QA engineers or running heavyweight symbolic execution tools.
Bug discovery – The tool’s focus on newly‑changed code surfaces edge‑case failures that existing tests miss, as demonstrated by the two novel bugs found in the evaluation.
Language‑agnostic potential – While evaluated on Python scientific libraries, the same workflow (patch coverage → context extraction → LLM prompt) can be adapted to other ecosystems (JavaScript, Java, Rust) with appropriate test‑context parsers.

Limitations & Future Work

LLM reliability – Generated tests sometimes contain flaky assertions or rely on external resources; a more robust post‑generation validation step is needed.
Context extraction heuristics – Current static‑analysis heuristics work well for Python but may miss nuanced fixtures in other languages or frameworks.
Scalability to massive PRs – Extremely large diffs can overwhelm the prompt length limits of current LLM APIs; chunking strategies are an open problem.
Security considerations – Auto‑generated test code runs in the CI environment; safeguards against malicious payloads (e.g., network calls) must be enforced.
User control – Future versions could let developers specify coverage targets, test style preferences, or exclude certain modules from augmentation.

By addressing these challenges, ChaCo could become a staple of modern CI pipelines, turning the “last mile” of regression testing from a manual chore into an automated, low‑cost safety net.

Authors

Zitong Zhou
Matteo Paltenghi
Miryung Kim
Michael Pradel

Paper Information

arXiv ID: 2601.10942v1
Categories: cs.SE
Published: January 16, 2026
PDF: Download PDF

[Paper] Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Applying Formal Methods Tools to an Electronic Warfare Codebase (Experience report)

[Paper] A Practical Guide to Establishing Technical Debt Management

[Paper] RITA: A Tool for Automated Requirements Classification and Specification from Online User Feedback

[Paper] Automation and Reuse Practices in GitHub Actions Workflows: A Practitioner's Perspective