[Paper] Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection
Source: arXiv - 2603.17974v1
Overview
The paper tackles a core bottleneck in machine‑learning‑driven security: the lack of large, realistic datasets that reflect how vulnerabilities appear in full‑stack code repositories. By automatically injecting authentic bugs into real open‑source projects and generating reproducible proof‑of‑vulnerability (PoV) exploits, the author creates scalable, repository‑level benchmarks that can be used to train and evaluate next‑generation vulnerability detectors.
Key Contributions
- Automated benchmark generator that inserts diverse, realistic vulnerabilities into live open‑source repositories while preserving buildability and execution semantics.
- Synthetic PoV exploit synthesis that produces end‑to‑end, reproducible attacks for each injected flaw, giving precise ground‑truth labels.
- Adversarial co‑evolution framework where a vulnerability‑injection agent and a detection agent iteratively improve each other, mimicking an arms‑race between attackers and defenders.
- Extensive empirical evaluation showing that the generated datasets scale to thousands of repositories and capture interprocedural, cross‑file interactions absent in function‑centric corpora.
- Open‑source tooling and dataset release to enable reproducibility and community‑wide adoption.
Methodology
- Repository Harvesting – A crawler collects a large pool of popular GitHub repositories that compile and pass their own test suites.
- Vulnerability Injection Engine – Using a catalog of known CWE patterns (e.g., buffer overflow, use‑after‑free, SQL injection), the engine automatically mutates source files, adds or modifies code snippets, and updates build scripts to keep the project buildable.
- PoV Exploit Synthesis – For each injected bug, a lightweight symbolic executor or fuzzing harness generates a minimal input that triggers the vulnerability, producing a reproducible PoV script (e.g., a unit test or exploit binary).
- Label Generation – The injection point, affected functions/files, and the PoV are stored as precise annotations, yielding a fully labeled dataset without manual effort.
- Adversarial Co‑Evolution Loop – An injection model (the “attacker”) proposes new bug instances; a detection model (the “defender”) attempts to flag them. Mis‑detected cases are fed back to improve both models, encouraging robustness against adaptive attacks.
The pipeline is containerized, runs on commodity hardware, and can be scheduled to continuously refresh the benchmark as new repositories appear.
Results & Findings
- Scale: The system generated > 3,000 vulnerable commits across 500+ repositories, a 10× increase over the largest manually curated repo‑level benchmark.
- Realism: 92 % of injected bugs compiled without errors, and 87 % of PoVs executed successfully on the mutated code, confirming functional realism.
- Detection Gap: State‑of‑the‑art repo‑level vulnerability detectors (e.g., DeepVuln, CodeBERT‑Vul) missed 68 % of the newly injected bugs, highlighting a substantial generalization gap.
- Adversarial Gains: After five co‑evolution cycles, the detection model’s recall improved from 32 % to 58 % on the generated set, while the injection model learned to produce harder‑to‑detect patterns, demonstrating the usefulness of the arms‑race setup.
Practical Implications
- Training Better Models: Developers of ML‑based security tools can now train on datasets that reflect real build pipelines, cross‑file data flows, and exploitability, leading to detectors that work out‑of‑the‑box on production codebases.
- Continuous Benchmarking: Organizations can integrate the generator into CI pipelines to automatically assess their own vulnerability scanners against fresh, realistic threats.
- Red‑Team Automation: Security teams can use the injection engine to simulate “unknown” bugs in their code, testing incident response and patching processes without exposing real vulnerabilities.
- Research Acceleration: The open dataset lowers the entry barrier for academic and industry researchers, fostering reproducible comparisons and faster iteration on novel detection algorithms.
Limitations & Future Work
- Synthetic Bias: Although the injected bugs follow known CWE patterns, they may not capture the full creativity of human attackers, potentially biasing models toward known flaw signatures.
- Language Coverage: The current implementation focuses on C/C++ projects; extending to managed languages (Java, Python) and mixed‑language ecosystems remains future work.
- Exploit Fidelity: PoV generation relies on lightweight symbolic execution; highly complex, multi‑stage exploits (e.g., ROP chains) are not yet synthesized.
- Adversarial Loop Cost: The co‑evolution process is computationally intensive; scaling to millions of commits will require distributed training strategies and smarter sampling.
The author’s release of the generator and the first batch of datasets invites the community to address these gaps and push the frontier of repo‑level vulnerability detection.
Authors
- Amine Lbath
Paper Information
- arXiv ID: 2603.17974v1
- Categories: cs.SE, cs.AI
- Published: March 18, 2026
- PDF: Download PDF