[Paper] LLMs in Code Vulnerability Analysis: A Proof of Concept
Source: arXiv - 2601.08691v1
Overview
The paper investigates whether modern Large Language Models (LLMs) can be used to automate core security tasks on C/C++ code—spotting vulnerabilities, estimating their severity, and even generating patches. By testing both code‑focused and general‑purpose open‑source LLMs on two public vulnerability datasets, the authors show that LLM‑driven analysis is feasible and, with fine‑tuning, can outperform zero‑shot prompting.
Key Contributions
- Empirical benchmark of five recent LLM families (code‑specialized and general‑purpose) on vulnerability detection, severity prediction, and automated repair.
- Comparison of fine‑tuning vs. prompt‑based (zero‑shot/few‑shot) strategies, demonstrating a consistent advantage for fine‑tuned models.
- Insight into model behavior: code‑specific models shine in zero/few‑shot settings on harder tasks, while general models stay competitive after fine‑tuning.
- Critical evaluation of existing code‑generation metrics (CodeBLEU, CodeBERTScore, BLEU, ChrF), exposing their inability to capture true repair quality.
- Open‑source proof‑of‑concept pipeline that can be extended to other languages or security datasets.
Methodology
- Datasets – The authors used two well‑known C/C++ vulnerability corpora:
- Big‑Vul – a collection of real‑world vulnerable functions with CVE annotations.
- Vul‑Repair – pairs of vulnerable code snippets and their human‑written patches.
- Models – Five LLM families were selected, each represented by a code‑specialized and a general‑purpose open‑source variant (e.g., CodeLlama vs. Llama‑2).
- Task formulation
- Vulnerability identification: binary classification (vulnerable vs. clean).
- Severity & access‑complexity prediction: multi‑class classification mirroring CVSS fields.
- Patch generation: sequence‑to‑sequence generation of a repaired code snippet.
- Training regimes
- Fine‑tuning: full‑model updates on the task‑specific training split.
- Prompt‑based: zero‑shot (plain instruction) and few‑shot (≤5 examples) prompting without weight updates.
- Evaluation – Standard classification metrics (accuracy, F1) for detection/severity, and several code‑generation metrics (CodeBLEU, CodeBERTScore, BLEU, ChrF) for repair quality.
Results & Findings
- Fine‑tuning wins: Across all three tasks, fine‑tuned models achieved higher accuracy/F1 than any zero‑ or few‑shot prompt configuration.
- Code‑specialized models excel in low‑resource prompting: When only a handful of examples were provided, models trained on code data outperformed their general counterparts, especially on the more complex patch‑generation task.
- General‑purpose models close the gap after fine‑tuning: Once fine‑tuned, the performance difference between code‑specific and general models shrank dramatically, suggesting that task‑specific data matters more than pre‑training domain.
- Metric mismatch: High scores on BLEU/ChrF did not always correlate with functional correctness of patches, highlighting that current automatic metrics are insufficient for security‑critical code repair.
Practical Implications
- Automated triage – Development teams can plug a fine‑tuned LLM into CI pipelines to flag high‑severity C/C++ bugs early, reducing manual review load.
- Assistive patch generation – Security engineers can use the model’s suggested fixes as a starting point, accelerating remediation while still applying human verification.
- Cost‑effective security tooling – Open‑source LLMs (no licensing fees) can be customized for an organization’s codebase, offering a cheaper alternative to commercial static analysis suites.
- Metric redesign – The study urges tool builders to adopt functional or execution‑based validation (e.g., test‑suite pass rates) rather than relying solely on surface‑level similarity scores.
Limitations & Future Work
- Scope limited to C/C++: Results may not transfer directly to other languages (e.g., JavaScript, Rust) without additional data.
- Dataset bias: Both Big‑Vul and Vul‑Repair contain curated, relatively small samples; real‑world codebases may present more diverse vulnerability patterns.
- Security guarantees – Generated patches are not guaranteed to be safe; thorough testing and code review remain essential.
- Metric development – The authors call for new evaluation frameworks that capture functional correctness and security impact, a direction for follow‑up research.
Bottom line: This proof‑of‑concept demonstrates that with modest fine‑tuning, open‑source LLMs can become practical allies in the ongoing battle against software vulnerabilities, opening the door for more intelligent, developer‑friendly security tooling.
Authors
- Shaznin Sultana
- Sadia Afreen
- Nasir U. Eisty
Paper Information
- arXiv ID: 2601.08691v1
- Categories: cs.SE
- Published: January 13, 2026
- PDF: Download PDF