[Paper] From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection
Source: arXiv - 2512.10485v1
Overview
The paper From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection examines whether state‑of‑the‑art deep‑learning (DL) and large‑language‑model (LLM) approaches that excel on academic benchmarks can actually spot bugs in real‑world code. By testing two popular DL models (ReVeal and LineVul) and four leading LLMs on a freshly collected, out‑of‑distribution set of Linux‑kernel fixes, the authors expose a stark performance gap between “lab” results and production‑level security needs.
Key Contributions
- Comprehensive cross‑dataset study – Trains and evaluates ReVeal and LineVul on four widely used vulnerability datasets (Juliet, Devign, BigVul, ICVul) and visualizes their learned code embeddings with t‑SNE.
- Real‑world benchmark (VentiVul) – Curates a time‑wise out‑of‑distribution dataset of 20 Linux‑kernel vulnerabilities fixed in May 2025, representing the kind of code developers actually encounter.
- LLM comparison – Benchmarks four pretrained LLMs (Claude 3.5 Sonnet, GPT‑o3‑mini, GPT‑4o, GPT‑5) on VentiVul using the same detection pipeline as the DL models.
- Empirical evidence of poor generalization – Shows that both DL models and LLMs collapse in accuracy when moving from curated benchmarks to VentiVul, highlighting over‑fitting to dataset‑specific patterns.
- Evaluation framework – Proposes a deployment‑oriented methodology (independent training per dataset, representation analysis, out‑of‑distribution testing) that can be reused by researchers and security teams.
Methodology
- Model selection – Two representative DL detectors:
- ReVeal (graph‑neural‑network based, operates on program‑dependence graphs).
- LineVul (transformer‑based, works on tokenized source lines).
- Dataset preparation – Each model is trained from scratch on one of four benchmark datasets (Juliet, Devign, BigVul, ICVul). No cross‑dataset fine‑tuning is performed, mirroring typical academic practice.
- Embedding inspection – After training, the models’ internal code representations are projected into 2‑D using t‑SNE. The authors look for clustering of vulnerable vs. non‑vulnerable snippets.
- Real‑world test set (VentiVul) – 20 vulnerability patches from the Linux kernel, collected after the training data cut‑off (May 2025). Each patch is split into “vulnerable” (pre‑fix) and “non‑vulnerable” (post‑fix) snippets.
- LLM prompting – The same VentiVul snippets are fed to four LLMs via a zero‑shot prompt asking the model to label the code as vulnerable or safe. No additional fine‑tuning is applied.
- Metrics – Standard detection metrics (precision, recall, F1) are reported for each model/dataset combination, with a focus on the drop‑off when moving to VentiVul.
Results & Findings
| Model / Dataset | Avg. F1 on Benchmark | F1 on VentiVul |
|---|---|---|
| ReVeal (Juliet) | 0.84 | 0.31 |
| ReVeal (Devign) | 0.78 | 0.28 |
| LineVul (BigVul) | 0.81 | 0.34 |
| LineVul (ICVul) | 0.77 | 0.30 |
| Claude 3.5 Sonnet (LLM) | – (zero‑shot) | 0.36 |
| GPT‑o3‑mini | – | 0.22 |
| GPT‑4o | – | 0.38 |
| GPT‑5 | – | 0.41 |
Key observations
- Embedding collapse – t‑SNE plots show little separation between vulnerable and safe code, indicating that the learned representations are not capturing robust security semantics.
- Dataset over‑fitting – High F1 scores on the original benchmarks disappear almost entirely on VentiVul, confirming poor cross‑distribution generalization.
- LLMs are not a silver bullet – Even the most advanced LLM (GPT‑5) only marginally outperforms the DL models, and all still miss a majority of real vulnerabilities.
- Time‑wise OOD effect – The fact that VentiVul consists of patches written after the training data cut‑off amplifies the distribution shift, mirroring a realistic deployment scenario.
Practical Implications
- Security tooling teams should be skeptical of benchmark‑only claims – A model that scores >80 % on Devign may still be useless for day‑to‑day code review.
- Dataset quality matters – Curated, up‑to‑date, and diverse code corpora (including recent kernel patches, open‑source libraries, and real‑world CI logs) are essential for training models that survive production drift.
- Hybrid approaches – Combining static analysis heuristics with DL/LLM predictions could mitigate false negatives, especially when models are uncertain.
- Continuous re‑training – Deployments need pipelines that ingest newly fixed vulnerabilities (e.g., from CVE databases) to keep the model’s knowledge current.
- Explainability hooks – Since embeddings are not clearly separating vulnerable code, adding attention‑visualization or program‑analysis‑backed explanations can help developers trust (or reject) model suggestions.
Limitations & Future Work
- Small real‑world test set – VentiVul contains only 20 patches; larger, more varied OOD datasets are needed to confirm the trends.
- Zero‑shot LLM evaluation – The study does not explore fine‑tuning or few‑shot prompting, which could improve LLM performance.
- Focus on C/Linux kernel – Results may differ for other languages or ecosystems (e.g., JavaScript, Rust).
- Representation analysis limited to t‑SNE – More rigorous probing (e.g., linear separability tests, mutual information) could better characterize embedding quality.
- Future directions suggested by the authors include building a continuously updated “vulnerability stream” dataset, exploring contrastive learning objectives for code security, and integrating dynamic execution traces to enrich model inputs.
Authors
- Chaomeng Lu
- Bert Lagaisse
Paper Information
- arXiv ID: 2512.10485v1
- Categories: cs.CR, cs.LG, cs.SE
- Published: December 11, 2025
- PDF: Download PDF