[Paper] Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading
Source: arXiv - 2604.20742v1
Overview
This paper questions a long‑standing assumption in software defect prediction (SDP): that the Area Under the ROC Curve (AUC) is a trustworthy, “one‑size‑fits‑all” metric for judging how well a model separates faulty from clean modules. By visualising the ROC curve with explicit threshold markers, the authors show that a high AUC can mask serious shortcomings—especially when a model’s true‑positive and false‑positive rates dip below random‑guess performance for certain thresholds.
Key Contributions
- Critical analysis of AUC: Demonstrates concrete scenarios where a model with a high AUC still behaves worse than random for specific threshold choices.
- Decorated ROC curves: Introduces a simple visual augmentation (highlighting threshold points) that makes hidden performance gaps immediately visible.
- Threshold‑based performance plots: Provides alternative graphs that plot TPR and FPR as functions of the decision threshold, offering a more granular view of model behavior.
- Guidelines for SDP evaluation: Recommends using enriched ROC visualisations or complementary metrics instead of relying solely on AUC.
Methodology
- Dataset & Models: Reused several publicly available defect‑prediction datasets (e.g., NASA, PROMISE) and trained typical classifiers (logistic regression, random forests, SVM).
- Standard ROC/AUC Computation: Computed the classic ROC curve and its AUC for each model, as is customary in the literature.
- Decoration Process: Overlaid the ROC curve with markers that correspond to a dense set of probability thresholds (e.g., every 0.01), making it easy to see where the curve crosses the random‑guess diagonal.
- Threshold‑Response Plots: Plotted TPR(θ) and FPR(θ) against the threshold θ, exposing intervals where TPR < FPR (i.e., worse than random).
- Comparative Analysis: Juxtaposed traditional AUC numbers with the decorated visualisations, identifying cases where AUC gave a misleadingly optimistic assessment.
The approach requires only the model’s predicted probabilities—no extra data or complex calculations—making it readily reproducible for any SDP project.
Results & Findings
- High AUC ≠ Uniform Superiority: Several models achieved AUC > 0.80 yet exhibited ranges of thresholds where the true‑positive rate fell below the false‑positive rate, meaning they performed worse than random guessing for those operating points.
- Threshold Sensitivity: The shape of the TPR/FPR curves revealed that some models are only reliable when the threshold is set very low (high recall) or very high (high precision), limiting their practical usefulness.
- Visual Diagnosis: Decorated ROC curves instantly highlighted the problematic sections—something that a single AUC number completely hides.
- Alternative Metrics Needed: Metrics that consider specific operating points (e.g., precision‑recall curves, cost‑sensitive loss) or the full TPR/FPR threshold functions provide a more honest picture of model readiness for deployment.
Practical Implications
- Model Selection: Teams should not discard a model solely because its AUC is modest; conversely, they should not accept a model just because its AUC is high. Inspect the threshold‑specific behavior to ensure the model meets the project’s risk tolerance (e.g., low false‑positive cost).
- Threshold Tuning: When deploying an SDP model, use the TPR/FPR‑vs‑threshold plots to pick a decision threshold that aligns with business constraints (e.g., limited QA resources).
- Reporting Standards: Incorporate decorated ROC curves or threshold‑response plots into internal dashboards, code reviews, or research papers to avoid “AUC‑only” hype.
- Tooling: Adding a few lines of code (e.g., using
matplotlib/seabornin Python orggplot2in R) can generate the enriched visualisations, so existing CI pipelines can automatically flag models with hidden performance pitfalls.
Limitations & Future Work
- Dataset Scope: Focused on classic defect‑prediction benchmarks; results may differ on newer, larger‑scale industrial datasets.
- Model Variety: Only a handful of standard classifiers were examined; deep‑learning‑based SDP models might exhibit different AUC‑vs‑threshold dynamics.
- Automated Threshold Selection: No algorithmic method to choose the “best” threshold is proposed; future work could integrate cost‑sensitive optimization with the visual diagnostics.
- User Studies: The authors suggest (but do not conduct) empirical studies on whether developers actually make better decisions when presented with decorated ROC curves.
By urging the community to look beyond a single scalar metric, this work nudges both researchers and practitioners toward more transparent, actionable evaluations of defect‑prediction models.
Authors
- Luigi Lavazza
- Gabriele Rotoloni
- Sandro Morasca
Paper Information
- arXiv ID: 2604.20742v1
- Categories: cs.SE
- Published: April 22, 2026
- PDF: Download PDF