[Paper] When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction

Published: 1 month ago (December 19, 2025 at 06:21 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17460v1

Overview

Software defect prediction (SDP) models promise to flag buggy code before it ships, but their success hinges on the quality of the data they are trained on. This paper presents the first large‑scale, joint investigation of five data‑quality problems that commonly appear together in SDP datasets, analyzing how they interact and affect model performance across hundreds of real‑world projects.

Key Contributions

Comprehensive empirical scope: Analyzed 374 SDP datasets covering a wide range of domains and five popular classifiers.
Multi‑issue focus: Simultaneously examined class imbalance, class overlap, irrelevant features, attribute noise, and outliers—the first study to treat these problems as co‑occurring rather than isolated.
Explainable Boosting Machines (EBM) + interaction analysis: Leveraged an interpretable ML model to quantify both direct effects of each issue and conditional effects when issues appear together.
Threshold (“tipping‑point”) discovery: Identified stable quality‑metric cut‑offs (e.g., overlap ≈ 0.20, imbalance ≈ 0.65–0.70, irrelevance ≈ 0.94) beyond which most classifiers start to degrade.
Counter‑intuitive insights: Showed that outliers can improve prediction accuracy when the dataset contains few irrelevant features, highlighting the need for context‑aware preprocessing.
Performance‑robustness trade‑off: Demonstrated that no single learner dominates across all quality‑issue combinations, encouraging adaptive model selection.

Methodology

Dataset collection – Gathered 374 publicly available defect‑prediction datasets (e.g., from PROMISE, NASA, and GitHub) that already contain the typical software metrics (lines of code, cyclomatic complexity, etc.).
Quality‑issue quantification – For each dataset, computed five numeric indicators:
- Imbalance: proportion of defective vs. clean modules.
- Overlap: measured by the Hellinger distance between class‑conditional feature distributions.
- Irrelevance: fraction of features with near‑zero mutual information to the target.
- Attribute noise: estimated via label‑flipping simulations.
- Outliers: proportion of instances flagged by the Local Outlier Factor.
Model training – Trained five widely used classifiers (Random Forest, Logistic Regression, SVM, Naïve Bayes, and Gradient Boosting) using default hyper‑parameters to reflect typical practitioner usage.
Explainable Boosting Machines – Used EBMs as a meta‑model to predict each classifier’s performance (e.g., AUC) from the five quality metrics, enabling transparent attribution of effects.
Stratified interaction analysis – Partitioned the data by one quality metric (e.g., overlap) and examined how the impact of another metric (e.g., outliers) changed within each stratum, revealing conditional relationships.
Statistical validation – Applied bootstrapping and non‑parametric tests to ensure that observed patterns are robust across the heterogeneous dataset collection.

Results & Findings

Quality Issue	Prevalence (≥ some presence)	Direct impact on performance	Notable interaction
Class imbalance	~98 % of datasets	Degrades AUC once imbalance > 0.65–0.70	Amplifies harm of overlap
Class overlap	~95 %	Most consistently harmful; sharp drop after overlap ≈ 0.20	Outliers can offset overlap when irrelevance is low
Irrelevant features	~99 %	Performance collapses near irrelevance ≈ 0.94	Interaction with noise is modest
Attribute noise	~93 % (always co‑occurs)	Small direct effect, but worsens with high overlap	–
Outliers	~90 %	Generally neutral, but improves performance when irrelevant‑feature rate < 0.30	Positive synergy with low irrelevance

Co‑occurrence is the norm: Even the rarest issue (attribute noise) appears alongside at least one other problem in > 93 % of datasets.
Tipping‑point thresholds are remarkably stable across classifiers, suggesting they could serve as quick sanity checks before model training.
No universal winner: Random Forest and Gradient Boosting perform best on “clean” datasets, while Logistic Regression and Naïve Bayes are more resilient under extreme imbalance.

Practical Implications

Pre‑training health checks – Developers can compute the three key thresholds (overlap 0.20, imbalance 0.65, irrelevance 0.94) as lightweight diagnostics. Crossing a threshold signals that a dataset needs remediation before SDP modeling.
Targeted preprocessing –
- If overlap is high, consider feature engineering or instance‑level resampling (e.g., SMOTE‑ENN).
- When imbalance dominates, apply cost‑sensitive learning or balanced bagging rather than generic oversampling.
- For irrelevant features, aggressive feature selection (mutual information, L1 regularization) yields immediate gains.
Adaptive model selection – The study’s interaction map can be encoded into a simple rule‑based selector: e.g., “If imbalance > 0.70 and overlap > 0.20 → use Logistic Regression with class weighting.”
Tooling integration – The authors released an open‑source Python package that computes the five quality metrics and visualizes interaction heatmaps; teams can plug this into CI pipelines to flag data‑quality regressions.
Outlier handling nuance – Blind removal of outliers may hurt performance in low‑irrelevance scenarios; instead, evaluate the outlier‑impact curve before pruning.

Overall, the paper equips practitioners with a data‑aware checklist that moves defect‑prediction from “train‑and‑hope” to “measure‑and‑adjust”.

Limitations & Future Work

Default hyper‑parameters only – While reflecting common practice, the study does not explore how extensive hyper‑parameter tuning might mitigate some quality issues.
Static datasets – All datasets are snapshot versions; the dynamics of evolving code bases (concept drift) are not addressed.
Metric scope – Only five quality dimensions were examined; other factors like duplicate instances, measurement error, or temporal leakage remain unexplored.
Generalizability beyond SDP – The findings are specific to defect‑prediction metrics; applying the same framework to other SE prediction tasks (effort estimation, security vulnerability detection) is an open avenue.

Future research directions include (1) integrating automated remediation pipelines that act on the identified thresholds, (2) extending the interaction analysis to deep‑learning‑based SDP models, and (3) studying how co‑occurring quality issues evolve over the software lifecycle.

Authors

Emmanuel Charleson Dapaah
Jens Grabowski

Paper Information

arXiv ID: 2512.17460v1
Categories: cs.SE, cs.LG
Published: December 19, 2025
PDF: Download PDF

[Paper] When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy