[Paper] Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading

Published: (April 22, 2026 at 12:28 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.20742v1

Overview

This paper questions a long‑standing assumption in software defect prediction (SDP): that the Area Under the ROC Curve (AUC) is a trustworthy, “one‑size‑fits‑all” metric for judging how well a model separates faulty from clean modules. By visualising the ROC curve with explicit threshold markers, the authors show that a high AUC can mask serious shortcomings—especially when a model’s true‑positive and false‑positive rates dip below random‑guess performance for certain thresholds.

Key Contributions

  • Critical analysis of AUC: Demonstrates concrete scenarios where a model with a high AUC still behaves worse than random for specific threshold choices.
  • Decorated ROC curves: Introduces a simple visual augmentation (highlighting threshold points) that makes hidden performance gaps immediately visible.
  • Threshold‑based performance plots: Provides alternative graphs that plot TPR and FPR as functions of the decision threshold, offering a more granular view of model behavior.
  • Guidelines for SDP evaluation: Recommends using enriched ROC visualisations or complementary metrics instead of relying solely on AUC.

Methodology

  1. Dataset & Models: Reused several publicly available defect‑prediction datasets (e.g., NASA, PROMISE) and trained typical classifiers (logistic regression, random forests, SVM).
  2. Standard ROC/AUC Computation: Computed the classic ROC curve and its AUC for each model, as is customary in the literature.
  3. Decoration Process: Overlaid the ROC curve with markers that correspond to a dense set of probability thresholds (e.g., every 0.01), making it easy to see where the curve crosses the random‑guess diagonal.
  4. Threshold‑Response Plots: Plotted TPR(θ) and FPR(θ) against the threshold θ, exposing intervals where TPR < FPR (i.e., worse than random).
  5. Comparative Analysis: Juxtaposed traditional AUC numbers with the decorated visualisations, identifying cases where AUC gave a misleadingly optimistic assessment.

The approach requires only the model’s predicted probabilities—no extra data or complex calculations—making it readily reproducible for any SDP project.

Results & Findings

  • High AUC ≠ Uniform Superiority: Several models achieved AUC > 0.80 yet exhibited ranges of thresholds where the true‑positive rate fell below the false‑positive rate, meaning they performed worse than random guessing for those operating points.
  • Threshold Sensitivity: The shape of the TPR/FPR curves revealed that some models are only reliable when the threshold is set very low (high recall) or very high (high precision), limiting their practical usefulness.
  • Visual Diagnosis: Decorated ROC curves instantly highlighted the problematic sections—something that a single AUC number completely hides.
  • Alternative Metrics Needed: Metrics that consider specific operating points (e.g., precision‑recall curves, cost‑sensitive loss) or the full TPR/FPR threshold functions provide a more honest picture of model readiness for deployment.

Practical Implications

  • Model Selection: Teams should not discard a model solely because its AUC is modest; conversely, they should not accept a model just because its AUC is high. Inspect the threshold‑specific behavior to ensure the model meets the project’s risk tolerance (e.g., low false‑positive cost).
  • Threshold Tuning: When deploying an SDP model, use the TPR/FPR‑vs‑threshold plots to pick a decision threshold that aligns with business constraints (e.g., limited QA resources).
  • Reporting Standards: Incorporate decorated ROC curves or threshold‑response plots into internal dashboards, code reviews, or research papers to avoid “AUC‑only” hype.
  • Tooling: Adding a few lines of code (e.g., using matplotlib/seaborn in Python or ggplot2 in R) can generate the enriched visualisations, so existing CI pipelines can automatically flag models with hidden performance pitfalls.

Limitations & Future Work

  • Dataset Scope: Focused on classic defect‑prediction benchmarks; results may differ on newer, larger‑scale industrial datasets.
  • Model Variety: Only a handful of standard classifiers were examined; deep‑learning‑based SDP models might exhibit different AUC‑vs‑threshold dynamics.
  • Automated Threshold Selection: No algorithmic method to choose the “best” threshold is proposed; future work could integrate cost‑sensitive optimization with the visual diagnostics.
  • User Studies: The authors suggest (but do not conduct) empirical studies on whether developers actually make better decisions when presented with decorated ROC curves.

By urging the community to look beyond a single scalar metric, this work nudges both researchers and practitioners toward more transparent, actionable evaluations of defect‑prediction models.

Authors

  • Luigi Lavazza
  • Gabriele Rotoloni
  • Sandro Morasca

Paper Information

  • arXiv ID: 2604.20742v1
  • Categories: cs.SE
  • Published: April 22, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »