[Paper] Causal Inference for the Effect of Code Coverage on Bug Introduction

Published: 5 days ago (February 3, 2026 at 09:36 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.03585v1

Overview

The paper investigates whether higher test‑code coverage actually prevents bugs in real‑world JavaScript and TypeScript projects, moving beyond the usual correlation studies. By applying modern causal‑inference techniques, the authors aim to tell developers how much coverage is “enough” and whether the benefit plateaus after a certain point.

Key Contributions

Causal DAG for software engineering – a directed acyclic graph that maps out the hidden confounders (e.g., code churn, developer experience, CI frequency) linking coverage and bug introduction.
Generalized propensity‑score (GPS) adjustment for continuous exposure – adapts a technique usually seen in epidemiology to the software‑testing domain, allowing a nuanced dose‑response analysis.
Large‑scale, open‑source dataset – ≥ 10 k bug‑introducing and non‑bug‑introducing commits from mature JavaScript/TypeScript repositories, enriched with CI logs, review metadata, and static‑analysis metrics.
Doubly robust estimation – combines outcome regression with GPS weighting, providing protection against model misspecification.
Empirical dose‑response curve – quantifies the average treatment effect (ATE) of each additional percentage point of coverage and uncovers non‑linear patterns (thresholds, diminishing returns).

Methodology

Causal Graph Construction – The authors first list every observable factor that could simultaneously affect test coverage and bug likelihood (e.g., file size, recent refactorings, number of reviewers). These relationships are encoded in a DAG to identify a minimal set of confounders that must be controlled.
Data Collection – For each commit they extract:
- Coverage metrics (line, branch, statement) from CI reports,
- Bug‑introducing label (determined via the SZZ algorithm on issue trackers),
- Project‑level covariates (developer count, release cadence, code churn, review depth, CI frequency).
Generalized Propensity Score (GPS) – Because coverage is a continuous “treatment”, the GPS models the probability density of observing a particular coverage level given the confounders.
Doubly Robust Estimation – Two models are fitted: (a) an outcome regression predicting bug introduction from coverage and covariates, and (b) a GPS‑weighted regression. The final estimate averages both, yielding unbiased ATE even if one model is misspecified.
Dose‑Response Analysis – The continuous exposure allows the authors to plot bug‑introduction probability against coverage levels, testing for non‑linear effects (e.g., a steep drop up to 70 % coverage, then a plateau).

Results & Findings

Positive causal effect – Each additional 10 % of test coverage reduces the probability of a commit introducing a bug by roughly 3–5 % on average across the dataset.
Diminishing returns – The dose‑response curve shows a steep decline in bug risk up to about 75 % coverage; beyond that, the marginal benefit flattens, indicating a practical “sweet spot.”
Heterogeneity across projects – High‑velocity projects with frequent CI runs exhibit a stronger effect (up to 7 % reduction per 10 % coverage) compared with slower, less‑reviewed projects.
Robustness checks – Sensitivity analyses (e.g., varying the SZZ window, alternative GPS specifications) confirm that the observed effect is not driven by a single confounder.

Practical Implications

Coverage targets become data‑driven – Teams can aim for the 70‑80 % range to capture most of the bug‑prevention benefit without over‑investing in marginal coverage.
Prioritize CI integration – Projects that run tests on every pull request see a larger causal impact, suggesting that when coverage is measured matters as much as how much.
Resource allocation – Instead of chasing 100 % coverage, developers can redirect effort toward high‑risk areas (large churn files, low‑reviewed modules) where the causal gain is higher.
Tooling enhancements – CI dashboards could surface the estimated “bug‑risk reduction” per coverage point, helping managers make informed trade‑offs.

Limitations & Future Work

External validity – The study focuses on JavaScript/TypeScript open‑source projects; results may differ for compiled languages or proprietary codebases.
Reliance on SZZ – Bug‑introducing labels derived from SZZ can be noisy, potentially biasing the outcome variable.
Static covariates – Some confounders (e.g., developer expertise) are approximated by proxies; richer data could improve adjustment.
Future directions – Extending the causal framework to other quality metrics (e.g., mutation testing, static analysis warnings) and exploring interactive effects between coverage and code review practices.

Authors

Lukas Schulte
Gordon Fraser
Steffen Herbold

Paper Information

arXiv ID: 2602.03585v1
Categories: cs.SE
Published: February 3, 2026
PDF: Download PDF

[Paper] Causal Inference for the Effect of Code Coverage on Bug Introduction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Characterizing and Modeling the GitHub Security Advisories Review Pipeline

[Paper] When Elo Lies: Hidden Biases in Codeforces-Based Evaluation of Large Language Models

[Paper] Toward Quantum-Safe Software Engineering: A Vision for Post-Quantum Cryptography Migration

[Paper] A Bayesian Optimization-Based AutoML Framework for Non-Intrusive Load Monitoring