[Paper] Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities
Source: arXiv - 2511.23408v1
Overview
The paper investigates how well today’s large language models (LLMs) can automatically generate patches for software bugs that expose security vulnerabilities. By testing both real‑world flaws and synthetically created “artificial” bugs, the authors show that LLMs are noticeably better at fixing genuine vulnerabilities—and that different models bring complementary strengths.
Key Contributions
- Comprehensive benchmark covering multiple LLM families (OpenAI GPT‑3/4, LLaMA, DeepSeek, Mistral) on a mixed set of real and artificially injected vulnerabilities.
- Proof‑of‑Vulnerability (PoV) execution framework that runs the generated patches against the original exploit to verify whether the vulnerability is truly mitigated.
- Empirical evidence that LLMs patch real vulnerabilities more reliably than artificial ones.
- Analysis of overlap vs. complementarity among models, highlighting which bugs are fixed by several LLMs and which are only fixed by a single model.
- Guidelines for practitioners on selecting and combining LLMs to maximize automated patch coverage.
Methodology
-
Dataset construction
- Real vulnerabilities: 150 publicly disclosed CVEs with reproducible exploit code.
- Artificial vulnerabilities: 150 synthetic bugs injected into open‑source projects using a mutation engine that mimics common security patterns (e.g., buffer overflows, SQL injection).
-
LLM prompting
- A uniform “one‑shot” prompt was used: the model receives the vulnerable source file plus a brief description of the exploit and is asked to return a patched version.
- No fine‑tuning or multi‑turn interaction; the study focuses on the out‑of‑the‑box capability of each model.
-
Patch validation
- The authors compile the patched code and run the original PoV test harness.
- A patch is counted as successful only if the PoV test fails (i.e., the exploit no longer works) and the program’s original functionality remains intact (checked via regression tests).
-
Metrics
- Patch success rate (percentage of vulnerabilities fixed).
- Overlap: proportion of bugs fixed by multiple models.
- Complementarity: bugs fixed exclusively by a single model.
Results & Findings
| Model | Success on Real CVEs | Success on Artificial Bugs | Overlap (≥2 models) | Unique (only this model) |
|---|---|---|---|---|
| GPT‑4 | 68 % | 42 % | 31 % | 12 % |
| GPT‑3.5 | 55 % | 38 % | 28 % | 9 % |
| LLaMA‑2‑13B | 48 % | 30 % | 22 % | 7 % |
| DeepSeek‑7B | 45 % | 28 % | 20 % | 6 % |
| Mistral‑7B | 50 % | 33 % | 24 % | 8 % |
- Real > artificial: All models performed 15‑30 % better on authentic CVEs, suggesting that the natural context and code patterns in real bugs help LLMs generate correct fixes.
- Variability across models: No single model dominates; for many bugs, only one model succeeded, underscoring complementarity.
- Complementary ensembles: Combining the top three models (GPT‑4, GPT‑3.5, Mistral) raises the overall coverage to ~82 % on real vulnerabilities, compared to 68 % for GPT‑4 alone.
Practical Implications
- Automated triage pipelines: Security teams can integrate LLM‑driven one‑shot patch generation as a first‑line defense, automatically producing candidate patches that are then vetted by human reviewers.
- Model selection matters: Choosing a single “best” LLM may leave many bugs unpatched; a lightweight ensemble (e.g., GPT‑4 + Mistral) can dramatically improve coverage with modest extra compute.
- Focus on real‑world code: Training or prompting strategies that expose models to authentic codebases (rather than synthetic examples) are likely to yield better patch quality.
- Continuous integration: The PoV execution framework can be hooked into CI pipelines to automatically reject patches that fail the exploit test, ensuring only verified fixes are merged.
- Cost‑benefit balance: Since GPT‑4 achieves the highest absolute success rate, organizations with tighter budgets might start with GPT‑3.5 or open‑source alternatives (LLaMA, Mistral) and only invoke the larger model for the hardest cases.
Limitations & Future Work
- One‑shot prompting only: The study does not explore multi‑turn interactions or iterative refinement, which could boost success rates.
- Synthetic bug realism: Although the artificial vulnerabilities follow common patterns, they may still lack the nuanced context of real bugs, potentially skewing the “real vs. artificial” gap.
- Scalability to large codebases: Experiments were limited to relatively small functions; handling multi‑file projects and complex build systems remains an open challenge.
- Security of generated patches: The paper focuses on functional correctness; future work should assess whether LLM patches introduce new, subtle security issues.
Bottom line: LLMs are already competent at automatically patching many real security flaws, especially when used in a complementary ensemble. As prompting techniques and model capabilities evolve, we can expect even higher automation levels in vulnerability remediation—making LLM‑assisted patching a practical tool for today’s DevSecOps pipelines.
Authors
- Aayush Garg
- Zanis Ali Khan
- Renzo Degiovanni
- Qiang Tang
Paper Information
- arXiv ID: 2511.23408v1
- Categories: cs.CR, cs.AI, cs.SE
- Published: November 28, 2025
- PDF: Download PDF