[Paper] When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction
Source: arXiv - 2512.17460v1
Overview
Software defect prediction (SDP) models promise to flag buggy code before it ships, but their success hinges on the quality of the data they are trained on. This paper presents the first large‑scale, joint investigation of five data‑quality problems that commonly appear together in SDP datasets, analyzing how they interact and affect model performance across hundreds of real‑world projects.
Key Contributions
- Comprehensive empirical scope: Analyzed 374 SDP datasets covering a wide range of domains and five popular classifiers.
- Multi‑issue focus: Simultaneously examined class imbalance, class overlap, irrelevant features, attribute noise, and outliers—the first study to treat these problems as co‑occurring rather than isolated.
- Explainable Boosting Machines (EBM) + interaction analysis: Leveraged an interpretable ML model to quantify both direct effects of each issue and conditional effects when issues appear together.
- Threshold (“tipping‑point”) discovery: Identified stable quality‑metric cut‑offs (e.g., overlap ≈ 0.20, imbalance ≈ 0.65–0.70, irrelevance ≈ 0.94) beyond which most classifiers start to degrade.
- Counter‑intuitive insights: Showed that outliers can improve prediction accuracy when the dataset contains few irrelevant features, highlighting the need for context‑aware preprocessing.
- Performance‑robustness trade‑off: Demonstrated that no single learner dominates across all quality‑issue combinations, encouraging adaptive model selection.
Methodology
- Dataset collection – Gathered 374 publicly available defect‑prediction datasets (e.g., from PROMISE, NASA, and GitHub) that already contain the typical software metrics (lines of code, cyclomatic complexity, etc.).
- Quality‑issue quantification – For each dataset, computed five numeric indicators:
- Imbalance: proportion of defective vs. clean modules.
- Overlap: measured by the Hellinger distance between class‑conditional feature distributions.
- Irrelevance: fraction of features with near‑zero mutual information to the target.
- Attribute noise: estimated via label‑flipping simulations.
- Outliers: proportion of instances flagged by the Local Outlier Factor.
- Model training – Trained five widely used classifiers (Random Forest, Logistic Regression, SVM, Naïve Bayes, and Gradient Boosting) using default hyper‑parameters to reflect typical practitioner usage.
- Explainable Boosting Machines – Used EBMs as a meta‑model to predict each classifier’s performance (e.g., AUC) from the five quality metrics, enabling transparent attribution of effects.
- Stratified interaction analysis – Partitioned the data by one quality metric (e.g., overlap) and examined how the impact of another metric (e.g., outliers) changed within each stratum, revealing conditional relationships.
- Statistical validation – Applied bootstrapping and non‑parametric tests to ensure that observed patterns are robust across the heterogeneous dataset collection.
Results & Findings
| Quality Issue | Prevalence (≥ some presence) | Direct impact on performance | Notable interaction |
|---|---|---|---|
| Class imbalance | ~98 % of datasets | Degrades AUC once imbalance > 0.65–0.70 | Amplifies harm of overlap |
| Class overlap | ~95 % | Most consistently harmful; sharp drop after overlap ≈ 0.20 | Outliers can offset overlap when irrelevance is low |
| Irrelevant features | ~99 % | Performance collapses near irrelevance ≈ 0.94 | Interaction with noise is modest |
| Attribute noise | ~93 % (always co‑occurs) | Small direct effect, but worsens with high overlap | – |
| Outliers | ~90 % | Generally neutral, but improves performance when irrelevant‑feature rate < 0.30 | Positive synergy with low irrelevance |
- Co‑occurrence is the norm: Even the rarest issue (attribute noise) appears alongside at least one other problem in > 93 % of datasets.
- Tipping‑point thresholds are remarkably stable across classifiers, suggesting they could serve as quick sanity checks before model training.
- No universal winner: Random Forest and Gradient Boosting perform best on “clean” datasets, while Logistic Regression and Naïve Bayes are more resilient under extreme imbalance.
Practical Implications
- Pre‑training health checks – Developers can compute the three key thresholds (overlap 0.20, imbalance 0.65, irrelevance 0.94) as lightweight diagnostics. Crossing a threshold signals that a dataset needs remediation before SDP modeling.
- Targeted preprocessing –
- If overlap is high, consider feature engineering or instance‑level resampling (e.g., SMOTE‑ENN).
- When imbalance dominates, apply cost‑sensitive learning or balanced bagging rather than generic oversampling.
- For irrelevant features, aggressive feature selection (mutual information, L1 regularization) yields immediate gains.
- Adaptive model selection – The study’s interaction map can be encoded into a simple rule‑based selector: e.g., “If imbalance > 0.70 and overlap > 0.20 → use Logistic Regression with class weighting.”
- Tooling integration – The authors released an open‑source Python package that computes the five quality metrics and visualizes interaction heatmaps; teams can plug this into CI pipelines to flag data‑quality regressions.
- Outlier handling nuance – Blind removal of outliers may hurt performance in low‑irrelevance scenarios; instead, evaluate the outlier‑impact curve before pruning.
Overall, the paper equips practitioners with a data‑aware checklist that moves defect‑prediction from “train‑and‑hope” to “measure‑and‑adjust”.
Limitations & Future Work
- Default hyper‑parameters only – While reflecting common practice, the study does not explore how extensive hyper‑parameter tuning might mitigate some quality issues.
- Static datasets – All datasets are snapshot versions; the dynamics of evolving code bases (concept drift) are not addressed.
- Metric scope – Only five quality dimensions were examined; other factors like duplicate instances, measurement error, or temporal leakage remain unexplored.
- Generalizability beyond SDP – The findings are specific to defect‑prediction metrics; applying the same framework to other SE prediction tasks (effort estimation, security vulnerability detection) is an open avenue.
Future research directions include (1) integrating automated remediation pipelines that act on the identified thresholds, (2) extending the interaction analysis to deep‑learning‑based SDP models, and (3) studying how co‑occurring quality issues evolve over the software lifecycle.
Authors
- Emmanuel Charleson Dapaah
- Jens Grabowski
Paper Information
- arXiv ID: 2512.17460v1
- Categories: cs.SE, cs.LG
- Published: December 19, 2025
- PDF: Download PDF