[Paper] An Empirical Study of the Imbalance Issue in Software Vulnerability Detection
Source: arXiv - 2602.12038v1
Overview
Deep‑learning models are increasingly used to spot software vulnerabilities, but their accuracy can swing wildly from one codebase to another. This paper investigates why—pinpointing the extreme class imbalance (very few vulnerable snippets compared with a flood of clean code) as the main culprit. By systematically testing imbalance‑handling techniques on nine open‑source datasets, the authors reveal which tricks actually help and where they fall short.
Key Contributions
- Empirical confirmation that severe class imbalance drives the unstable performance of DL‑based vulnerability detectors.
- Large‑scale benchmark covering nine publicly available vulnerability datasets and two state‑of‑the‑art neural models.
- Systematic evaluation of four popular imbalance‑mitigation strategies (focal loss, mean false error, class‑balanced loss, random over‑sampling) across multiple metrics (precision, recall, F1).
- Practical guidance on which technique to prioritize depending on the metric most important to a given security workflow.
- Analysis of external factors (dataset size, vulnerability prevalence, code language) that influence the effectiveness of each mitigation method, laying groundwork for future, more adaptive solutions.
Methodology
- Datasets – The authors gathered nine open‑source vulnerability datasets spanning different programming languages (C, C++, Java, etc.) and varying degrees of imbalance (vulnerable code ranging from <1 % to a few percent).
- Models – Two cutting‑edge DL architectures for code analysis (a graph‑based model and a token‑level transformer) were trained on each dataset.
- Imbalance Techniques – Four strategies were applied during training:
- Focal loss (down‑weights easy negatives)
- Mean false error (optimizes a balanced error metric)
- Class‑balanced loss (re‑weights classes by effective number of samples)
- Random over‑sampling (duplicates minority examples)
- Evaluation – Models were assessed on held‑out test sets using precision, recall, and F1‑score. The authors also tracked how each technique behaved when dataset characteristics changed (e.g., varying vulnerable‑to‑non‑vulnerable ratios).
The whole pipeline was automated to ensure reproducibility, and statistical significance tests were applied to confirm observed differences.
Results & Findings
| Metric | Best‑performing technique (overall) |
|---|---|
| Precision | Focal loss – reduces false positives, making the detector more trustworthy for security analysts. |
| Recall | Mean false error & Class‑balanced loss – boost detection of rare vulnerable snippets, catching more true bugs. |
| F1‑measure | Random over‑sampling – offers the most balanced trade‑off between precision and recall. |
Key takeaways:
- No single technique dominates across all metrics; the “best” method depends on what the practitioner values (e.g., fewer false alarms vs. catching every possible bug).
- The effectiveness of each method varies noticeably between datasets—what works for a Java project may not help a C codebase.
- Over‑sampling, despite its simplicity, remains competitive, especially when the training set is very small.
Practical Implications
- Security tooling vendors can embed focal loss into their vulnerability‑scanning pipelines when the goal is to minimize noisy alerts for developers.
- Bug‑bounty platforms and internal audit teams that need to maximize coverage should consider class‑balanced or mean‑false‑error losses to raise recall.
- CI/CD integrations can dynamically switch mitigation strategies based on project‑specific imbalance ratios (e.g., auto‑detect the vulnerable‑code proportion and pick the appropriate loss function).
- Dataset curators are reminded that simply aggregating more clean code does not solve the problem; intentional sampling or synthetic vulnerable examples may be required to achieve a usable balance.
Overall, the study equips developers with a decision‑tree for choosing the right imbalance‑handling technique, potentially leading to more reliable automated security checks and fewer missed vulnerabilities in production code.
Limitations & Future Work
- The study focuses on two DL architectures; newer models (e.g., code‑specific large language models) might interact differently with imbalance solutions.
- Only four mitigation strategies were examined; advanced methods like generative adversarial oversampling or cost‑sensitive learning remain unexplored.
- External factors such as code semantics, developer coding style, or the presence of multi‑line vulnerabilities were not explicitly modeled.
- Future research could develop adaptive loss functions that automatically tune their parameters based on real‑time feedback from security analysts, or combine oversampling with semantic code augmentation to enrich the minority class.
Authors
- Yuejun Guo
- Qiang Hu
- Qiang Tang
- Yves Le Traon
Paper Information
- arXiv ID: 2602.12038v1
- Categories: cs.SE, cs.AI
- Published: February 12, 2026
- PDF: Download PDF