[Paper] An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

Published: 2 months ago (February 12, 2026 at 10:05 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.12038v1

Overview

Deep‑learning models are increasingly used to spot software vulnerabilities, but their accuracy can swing wildly from one codebase to another. This paper investigates why—pinpointing the extreme class imbalance (very few vulnerable snippets compared with a flood of clean code) as the main culprit. By systematically testing imbalance‑handling techniques on nine open‑source datasets, the authors reveal which tricks actually help and where they fall short.

Key Contributions

Empirical confirmation that severe class imbalance drives the unstable performance of DL‑based vulnerability detectors.
Large‑scale benchmark covering nine publicly available vulnerability datasets and two state‑of‑the‑art neural models.
Systematic evaluation of four popular imbalance‑mitigation strategies (focal loss, mean false error, class‑balanced loss, random over‑sampling) across multiple metrics (precision, recall, F1).
Practical guidance on which technique to prioritize depending on the metric most important to a given security workflow.
Analysis of external factors (dataset size, vulnerability prevalence, code language) that influence the effectiveness of each mitigation method, laying groundwork for future, more adaptive solutions.

Methodology

Datasets – The authors gathered nine open‑source vulnerability datasets spanning different programming languages (C, C++, Java, etc.) and varying degrees of imbalance (vulnerable code ranging from <1 % to a few percent).
Models – Two cutting‑edge DL architectures for code analysis (a graph‑based model and a token‑level transformer) were trained on each dataset.
Imbalance Techniques – Four strategies were applied during training:
- Focal loss (down‑weights easy negatives)
- Mean false error (optimizes a balanced error metric)
- Class‑balanced loss (re‑weights classes by effective number of samples)
- Random over‑sampling (duplicates minority examples)
Evaluation – Models were assessed on held‑out test sets using precision, recall, and F1‑score. The authors also tracked how each technique behaved when dataset characteristics changed (e.g., varying vulnerable‑to‑non‑vulnerable ratios).

The whole pipeline was automated to ensure reproducibility, and statistical significance tests were applied to confirm observed differences.

Results & Findings

Metric	Best‑performing technique (overall)
Precision	Focal loss – reduces false positives, making the detector more trustworthy for security analysts.
Recall	Mean false error & Class‑balanced loss – boost detection of rare vulnerable snippets, catching more true bugs.
F1‑measure	Random over‑sampling – offers the most balanced trade‑off between precision and recall.

Key takeaways:

No single technique dominates across all metrics; the “best” method depends on what the practitioner values (e.g., fewer false alarms vs. catching every possible bug).
The effectiveness of each method varies noticeably between datasets—what works for a Java project may not help a C codebase.
Over‑sampling, despite its simplicity, remains competitive, especially when the training set is very small.

Practical Implications

Security tooling vendors can embed focal loss into their vulnerability‑scanning pipelines when the goal is to minimize noisy alerts for developers.
Bug‑bounty platforms and internal audit teams that need to maximize coverage should consider class‑balanced or mean‑false‑error losses to raise recall.
CI/CD integrations can dynamically switch mitigation strategies based on project‑specific imbalance ratios (e.g., auto‑detect the vulnerable‑code proportion and pick the appropriate loss function).
Dataset curators are reminded that simply aggregating more clean code does not solve the problem; intentional sampling or synthetic vulnerable examples may be required to achieve a usable balance.

Overall, the study equips developers with a decision‑tree for choosing the right imbalance‑handling technique, potentially leading to more reliable automated security checks and fewer missed vulnerabilities in production code.

Limitations & Future Work

The study focuses on two DL architectures; newer models (e.g., code‑specific large language models) might interact differently with imbalance solutions.
Only four mitigation strategies were examined; advanced methods like generative adversarial oversampling or cost‑sensitive learning remain unexplored.
External factors such as code semantics, developer coding style, or the presence of multi‑line vulnerabilities were not explicitly modeled.
Future research could develop adaptive loss functions that automatically tune their parameters based on real‑time feedback from security analysts, or combine oversampling with semantic code augmentation to enrich the minority class.

Authors

Yuejun Guo
Qiang Hu
Qiang Tang
Yves Le Traon

Paper Information

arXiv ID: 2602.12038v1
Categories: cs.SE, cs.AI
Published: February 12, 2026
PDF: Download PDF

[Paper] An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Equilibrated adaptive learning rates for non-convex optimization

GANs Explained Simply: The Two-Neural-Network Battle That Changed AI

Haar Cascades to YOLO: Face Detection Migration Guide

Image Classification with CNNs – Part 3: Understanding Max Pooling and Results