[Paper] Few-shot learning for security bug report identification
Source: arXiv - 2601.02971v1
Overview
Security‑related bug reports must be spotted quickly to shrink the window of vulnerability in software products. However, building reliable classifiers traditionally demands thousands of labeled examples—something most teams simply don’t have for security bugs. This paper shows how a few‑shot learning framework called SetFit can achieve strong detection performance with only a handful of annotated reports, making security‑bug triage feasible even in data‑starved environments.
Key Contributions
- Application of SetFit to security bug detection: Adapts a state‑of‑the‑art few‑shot approach (sentence‑transformer + contrastive learning) for the binary task of security vs. non‑security bug reports.
- Demonstrated high AUC with minimal data: Achieves up to 0.865 AUC using as few as a few dozen labeled examples, consistently beating conventional ML baselines across multiple datasets.
- Parameter‑efficient fine‑tuning: Shows that only a tiny fraction of model parameters need updating, drastically reducing training time and computational cost.
- Practical annotation workflow: Provides a recipe for building a usable classifier with limited manual labeling effort, directly addressing real‑world constraints faced by security teams.
Methodology
- Data preparation – Collect bug reports from open‑source repositories and manually label a small subset (typically 10–50 per class) as security or non‑security.
- Sentence‑Transformer encoder – Use a pre‑trained transformer (e.g.,
all‑MiniLM‑L6‑v2) to embed each report into a dense vector space. - Contrastive fine‑tuning (SetFit)
- Generate positive pairs (reports of the same class) and negative pairs (different classes).
- Train the encoder with a contrastive loss to pull same‑class vectors together and push different‑class vectors apart.
- Follow with a lightweight classifier head (e.g., a linear layer) that is fine‑tuned on the few labeled examples.
- Evaluation – Perform stratified cross‑validation on the full dataset, reporting ROC‑AUC, precision, recall, and F1. Baselines include traditional classifiers (SVM, Random Forest) and fine‑tuned full‑size transformers.
The whole pipeline runs on a single GPU in minutes, thanks to the parameter‑efficient design.
Results & Findings
| Model | AUC (best) | Relative gain vs. best baseline |
|---|---|---|
| SetFit (few‑shot) | 0.865 | +0.07 |
| Full‑fine‑tuned BERT | 0.795 | – |
| SVM (TF‑IDF) | 0.742 | – |
| Random Forest | 0.718 | – |
- Consistent superiority: SetFit outperformed every baseline on all tested datasets, even when the number of labeled examples was reduced to 10 per class.
- Robustness to class imbalance: Contrastive training mitigated the typical drop in recall that plagues few‑shot classifiers on skewed security‑bug corpora.
- Speed & resource savings: Training required < 5 % of the GPU hours compared to full‑model fine‑tuning, making it viable for continuous integration pipelines.
Practical Implications
- Rapid triage pipelines: Teams can spin up a security‑bug detector after a quick labeling sprint (e.g., a half‑day effort), integrating it into issue‑tracking systems (Jira, GitHub) to auto‑flag suspicious reports.
- Cost‑effective security audits: Smaller organizations lacking large annotated corpora can still benefit from ML‑assisted security review without hiring dedicated data‑science resources.
- Continuous learning: Because SetFit fine‑tunes only a lightweight head, new labeled examples can be added on‑the‑fly, allowing the model to evolve as threat patterns shift.
- Tooling integration: The approach can be wrapped as a lightweight micro‑service (e.g., FastAPI) that consumes bug‑report text and returns a confidence score, fitting neatly into CI/CD or DevSecOps workflows.
Limitations & Future Work
- Domain transfer: The study focused on open‑source bug repositories; performance on proprietary, domain‑specific bug trackers (e.g., embedded systems) remains untested.
- Label noise: Few‑shot setups are sensitive to mislabeled examples; the paper notes that noisy annotations can degrade contrastive learning.
- Explainability: While SetFit yields high accuracy, it does not inherently provide interpretable reasons for a report’s security label—future work could integrate attention‑based explanations or post‑hoc methods.
- Multi‑label extension: Real‑world bug reports often involve multiple categories (e.g., performance + security). Extending the binary setup to multi‑label classification is an open research direction.
Bottom line: By leveraging few‑shot learning, the paper demonstrates a pragmatic path for developers and security engineers to harness AI for bug triage—even when labeled data is scarce. This could accelerate vulnerability remediation and lower the barrier to adopting ML‑driven security tooling across the software industry.
Authors
- Muhammad Laiq
Paper Information
- arXiv ID: 2601.02971v1
- Categories: cs.SE
- Published: January 6, 2026
- PDF: Download PDF