[Paper] Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading

Published: 1 day ago (April 22, 2026 at 12:28 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.20742v1

Overview

This paper questions a long‑standing assumption in software defect prediction (SDP): that the Area Under the ROC Curve (AUC) is a trustworthy, “one‑size‑fits‑all” metric for judging how well a model separates faulty from clean modules. By visualising the ROC curve with explicit threshold markers, the authors show that a high AUC can mask serious shortcomings—especially when a model’s true‑positive and false‑positive rates dip below random‑guess performance for certain thresholds.

Key Contributions

Critical analysis of AUC: Demonstrates concrete scenarios where a model with a high AUC still behaves worse than random for specific threshold choices.
Decorated ROC curves: Introduces a simple visual augmentation (highlighting threshold points) that makes hidden performance gaps immediately visible.
Threshold‑based performance plots: Provides alternative graphs that plot TPR and FPR as functions of the decision threshold, offering a more granular view of model behavior.
Guidelines for SDP evaluation: Recommends using enriched ROC visualisations or complementary metrics instead of relying solely on AUC.

Methodology

Dataset & Models: Reused several publicly available defect‑prediction datasets (e.g., NASA, PROMISE) and trained typical classifiers (logistic regression, random forests, SVM).
Standard ROC/AUC Computation: Computed the classic ROC curve and its AUC for each model, as is customary in the literature.
Decoration Process: Overlaid the ROC curve with markers that correspond to a dense set of probability thresholds (e.g., every 0.01), making it easy to see where the curve crosses the random‑guess diagonal.
Threshold‑Response Plots: Plotted TPR(θ) and FPR(θ) against the threshold θ, exposing intervals where TPR < FPR (i.e., worse than random).
Comparative Analysis: Juxtaposed traditional AUC numbers with the decorated visualisations, identifying cases where AUC gave a misleadingly optimistic assessment.

The approach requires only the model’s predicted probabilities—no extra data or complex calculations—making it readily reproducible for any SDP project.

Results & Findings

High AUC ≠ Uniform Superiority: Several models achieved AUC > 0.80 yet exhibited ranges of thresholds where the true‑positive rate fell below the false‑positive rate, meaning they performed worse than random guessing for those operating points.
Threshold Sensitivity: The shape of the TPR/FPR curves revealed that some models are only reliable when the threshold is set very low (high recall) or very high (high precision), limiting their practical usefulness.
Visual Diagnosis: Decorated ROC curves instantly highlighted the problematic sections—something that a single AUC number completely hides.
Alternative Metrics Needed: Metrics that consider specific operating points (e.g., precision‑recall curves, cost‑sensitive loss) or the full TPR/FPR threshold functions provide a more honest picture of model readiness for deployment.

Practical Implications

Model Selection: Teams should not discard a model solely because its AUC is modest; conversely, they should not accept a model just because its AUC is high. Inspect the threshold‑specific behavior to ensure the model meets the project’s risk tolerance (e.g., low false‑positive cost).
Threshold Tuning: When deploying an SDP model, use the TPR/FPR‑vs‑threshold plots to pick a decision threshold that aligns with business constraints (e.g., limited QA resources).
Reporting Standards: Incorporate decorated ROC curves or threshold‑response plots into internal dashboards, code reviews, or research papers to avoid “AUC‑only” hype.
Tooling: Adding a few lines of code (e.g., using matplotlib/seaborn in Python or ggplot2 in R) can generate the enriched visualisations, so existing CI pipelines can automatically flag models with hidden performance pitfalls.

Limitations & Future Work

Dataset Scope: Focused on classic defect‑prediction benchmarks; results may differ on newer, larger‑scale industrial datasets.
Model Variety: Only a handful of standard classifiers were examined; deep‑learning‑based SDP models might exhibit different AUC‑vs‑threshold dynamics.
Automated Threshold Selection: No algorithmic method to choose the “best” threshold is proposed; future work could integrate cost‑sensitive optimization with the visual diagnostics.
User Studies: The authors suggest (but do not conduct) empirical studies on whether developers actually make better decisions when presented with decorated ROC curves.

By urging the community to look beyond a single scalar metric, this work nudges both researchers and practitioners toward more transparent, actionable evaluations of defect‑prediction models.

Authors

Luigi Lavazza
Gabriele Rotoloni
Sandro Morasca

Paper Information

arXiv ID: 2604.20742v1
Categories: cs.SE
Published: April 22, 2026
PDF: Download PDF

[Paper] Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses

[Paper] Autark: A Serverless Toolkit for Prototyping Urban Visual Analytics Systems

[Paper] DeepParse: Hybrid Log Parsing with LLM-Synthesized Regex Masks

[Paper] On the Informativeness of Security Commit Messages: A Large-scale Replication Study