[Paper] LLMs in Code Vulnerability Analysis: A Proof of Concept

Published: 3 weeks ago (January 13, 2026 at 11:16 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.08691v1

Overview

The paper investigates whether modern Large Language Models (LLMs) can be used to automate core security tasks on C/C++ code—spotting vulnerabilities, estimating their severity, and even generating patches. By testing both code‑focused and general‑purpose open‑source LLMs on two public vulnerability datasets, the authors show that LLM‑driven analysis is feasible and, with fine‑tuning, can outperform zero‑shot prompting.

Key Contributions

Empirical benchmark of five recent LLM families (code‑specialized and general‑purpose) on vulnerability detection, severity prediction, and automated repair.
Comparison of fine‑tuning vs. prompt‑based (zero‑shot/few‑shot) strategies, demonstrating a consistent advantage for fine‑tuned models.
Insight into model behavior: code‑specific models shine in zero/few‑shot settings on harder tasks, while general models stay competitive after fine‑tuning.
Critical evaluation of existing code‑generation metrics (CodeBLEU, CodeBERTScore, BLEU, ChrF), exposing their inability to capture true repair quality.
Open‑source proof‑of‑concept pipeline that can be extended to other languages or security datasets.

Methodology

Datasets – The authors used two well‑known C/C++ vulnerability corpora:
- Big‑Vul – a collection of real‑world vulnerable functions with CVE annotations.
- Vul‑Repair – pairs of vulnerable code snippets and their human‑written patches.
Models – Five LLM families were selected, each represented by a code‑specialized and a general‑purpose open‑source variant (e.g., CodeLlama vs. Llama‑2).
Task formulation
- Vulnerability identification: binary classification (vulnerable vs. clean).
- Severity & access‑complexity prediction: multi‑class classification mirroring CVSS fields.
- Patch generation: sequence‑to‑sequence generation of a repaired code snippet.
Training regimes
- Fine‑tuning: full‑model updates on the task‑specific training split.
- Prompt‑based: zero‑shot (plain instruction) and few‑shot (≤5 examples) prompting without weight updates.
Evaluation – Standard classification metrics (accuracy, F1) for detection/severity, and several code‑generation metrics (CodeBLEU, CodeBERTScore, BLEU, ChrF) for repair quality.

Results & Findings

Fine‑tuning wins: Across all three tasks, fine‑tuned models achieved higher accuracy/F1 than any zero‑ or few‑shot prompt configuration.
Code‑specialized models excel in low‑resource prompting: When only a handful of examples were provided, models trained on code data outperformed their general counterparts, especially on the more complex patch‑generation task.
General‑purpose models close the gap after fine‑tuning: Once fine‑tuned, the performance difference between code‑specific and general models shrank dramatically, suggesting that task‑specific data matters more than pre‑training domain.
Metric mismatch: High scores on BLEU/ChrF did not always correlate with functional correctness of patches, highlighting that current automatic metrics are insufficient for security‑critical code repair.

Practical Implications

Automated triage – Development teams can plug a fine‑tuned LLM into CI pipelines to flag high‑severity C/C++ bugs early, reducing manual review load.
Assistive patch generation – Security engineers can use the model’s suggested fixes as a starting point, accelerating remediation while still applying human verification.
Cost‑effective security tooling – Open‑source LLMs (no licensing fees) can be customized for an organization’s codebase, offering a cheaper alternative to commercial static analysis suites.
Metric redesign – The study urges tool builders to adopt functional or execution‑based validation (e.g., test‑suite pass rates) rather than relying solely on surface‑level similarity scores.

Limitations & Future Work

Scope limited to C/C++: Results may not transfer directly to other languages (e.g., JavaScript, Rust) without additional data.
Dataset bias: Both Big‑Vul and Vul‑Repair contain curated, relatively small samples; real‑world codebases may present more diverse vulnerability patterns.
Security guarantees – Generated patches are not guaranteed to be safe; thorough testing and code review remain essential.
Metric development – The authors call for new evaluation frameworks that capture functional correctness and security impact, a direction for follow‑up research.

Bottom line: This proof‑of‑concept demonstrates that with modest fine‑tuning, open‑source LLMs can become practical allies in the ongoing battle against software vulnerabilities, opening the door for more intelligent, developer‑friendly security tooling.

Authors

Shaznin Sultana
Sadia Afreen
Nasir U. Eisty

Paper Information

arXiv ID: 2601.08691v1
Categories: cs.SE
Published: January 13, 2026
PDF: Download PDF

[Paper] LLMs in Code Vulnerability Analysis: A Proof of Concept

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Applying Formal Methods Tools to an Electronic Warfare Codebase (Experience report)

[Paper] A Practical Guide to Establishing Technical Debt Management

[Paper] RITA: A Tool for Automated Requirements Classification and Specification from Online User Feedback

[Paper] Automation and Reuse Practices in GitHub Actions Workflows: A Practitioner's Perspective