[Paper] Causal Inference for the Effect of Code Coverage on Bug Introduction
Source: arXiv - 2602.03585v1
Overview
The paper investigates whether higher test‑code coverage actually prevents bugs in real‑world JavaScript and TypeScript projects, moving beyond the usual correlation studies. By applying modern causal‑inference techniques, the authors aim to tell developers how much coverage is “enough” and whether the benefit plateaus after a certain point.
Key Contributions
- Causal DAG for software engineering – a directed acyclic graph that maps out the hidden confounders (e.g., code churn, developer experience, CI frequency) linking coverage and bug introduction.
- Generalized propensity‑score (GPS) adjustment for continuous exposure – adapts a technique usually seen in epidemiology to the software‑testing domain, allowing a nuanced dose‑response analysis.
- Large‑scale, open‑source dataset – ≥ 10 k bug‑introducing and non‑bug‑introducing commits from mature JavaScript/TypeScript repositories, enriched with CI logs, review metadata, and static‑analysis metrics.
- Doubly robust estimation – combines outcome regression with GPS weighting, providing protection against model misspecification.
- Empirical dose‑response curve – quantifies the average treatment effect (ATE) of each additional percentage point of coverage and uncovers non‑linear patterns (thresholds, diminishing returns).
Methodology
-
Causal Graph Construction – The authors first list every observable factor that could simultaneously affect test coverage and bug likelihood (e.g., file size, recent refactorings, number of reviewers). These relationships are encoded in a DAG to identify a minimal set of confounders that must be controlled.
-
Data Collection – For each commit they extract:
- Coverage metrics (line, branch, statement) from CI reports,
- Bug‑introducing label (determined via the SZZ algorithm on issue trackers),
- Project‑level covariates (developer count, release cadence, code churn, review depth, CI frequency).
-
Generalized Propensity Score (GPS) – Because coverage is a continuous “treatment”, the GPS models the probability density of observing a particular coverage level given the confounders.
-
Doubly Robust Estimation – Two models are fitted: (a) an outcome regression predicting bug introduction from coverage and covariates, and (b) a GPS‑weighted regression. The final estimate averages both, yielding unbiased ATE even if one model is misspecified.
-
Dose‑Response Analysis – The continuous exposure allows the authors to plot bug‑introduction probability against coverage levels, testing for non‑linear effects (e.g., a steep drop up to 70 % coverage, then a plateau).
Results & Findings
- Positive causal effect – Each additional 10 % of test coverage reduces the probability of a commit introducing a bug by roughly 3–5 % on average across the dataset.
- Diminishing returns – The dose‑response curve shows a steep decline in bug risk up to about 75 % coverage; beyond that, the marginal benefit flattens, indicating a practical “sweet spot.”
- Heterogeneity across projects – High‑velocity projects with frequent CI runs exhibit a stronger effect (up to 7 % reduction per 10 % coverage) compared with slower, less‑reviewed projects.
- Robustness checks – Sensitivity analyses (e.g., varying the SZZ window, alternative GPS specifications) confirm that the observed effect is not driven by a single confounder.
Practical Implications
- Coverage targets become data‑driven – Teams can aim for the 70‑80 % range to capture most of the bug‑prevention benefit without over‑investing in marginal coverage.
- Prioritize CI integration – Projects that run tests on every pull request see a larger causal impact, suggesting that when coverage is measured matters as much as how much.
- Resource allocation – Instead of chasing 100 % coverage, developers can redirect effort toward high‑risk areas (large churn files, low‑reviewed modules) where the causal gain is higher.
- Tooling enhancements – CI dashboards could surface the estimated “bug‑risk reduction” per coverage point, helping managers make informed trade‑offs.
Limitations & Future Work
- External validity – The study focuses on JavaScript/TypeScript open‑source projects; results may differ for compiled languages or proprietary codebases.
- Reliance on SZZ – Bug‑introducing labels derived from SZZ can be noisy, potentially biasing the outcome variable.
- Static covariates – Some confounders (e.g., developer expertise) are approximated by proxies; richer data could improve adjustment.
- Future directions – Extending the causal framework to other quality metrics (e.g., mutation testing, static analysis warnings) and exploring interactive effects between coverage and code review practices.
Authors
- Lukas Schulte
- Gordon Fraser
- Steffen Herbold
Paper Information
- arXiv ID: 2602.03585v1
- Categories: cs.SE
- Published: February 3, 2026
- PDF: Download PDF