[Paper] From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

Published: 1 month ago (December 11, 2025 at 05:04 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10485v1

Overview

The paper From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection examines whether state‑of‑the‑art deep‑learning (DL) and large‑language‑model (LLM) approaches that excel on academic benchmarks can actually spot bugs in real‑world code. By testing two popular DL models (ReVeal and LineVul) and four leading LLMs on a freshly collected, out‑of‑distribution set of Linux‑kernel fixes, the authors expose a stark performance gap between “lab” results and production‑level security needs.

Key Contributions

Comprehensive cross‑dataset study – Trains and evaluates ReVeal and LineVul on four widely used vulnerability datasets (Juliet, Devign, BigVul, ICVul) and visualizes their learned code embeddings with t‑SNE.
Real‑world benchmark (VentiVul) – Curates a time‑wise out‑of‑distribution dataset of 20 Linux‑kernel vulnerabilities fixed in May 2025, representing the kind of code developers actually encounter.
LLM comparison – Benchmarks four pretrained LLMs (Claude 3.5 Sonnet, GPT‑o3‑mini, GPT‑4o, GPT‑5) on VentiVul using the same detection pipeline as the DL models.
Empirical evidence of poor generalization – Shows that both DL models and LLMs collapse in accuracy when moving from curated benchmarks to VentiVul, highlighting over‑fitting to dataset‑specific patterns.
Evaluation framework – Proposes a deployment‑oriented methodology (independent training per dataset, representation analysis, out‑of‑distribution testing) that can be reused by researchers and security teams.

Methodology

Model selection – Two representative DL detectors:
- ReVeal (graph‑neural‑network based, operates on program‑dependence graphs).
- LineVul (transformer‑based, works on tokenized source lines).
Dataset preparation – Each model is trained from scratch on one of four benchmark datasets (Juliet, Devign, BigVul, ICVul). No cross‑dataset fine‑tuning is performed, mirroring typical academic practice.
Embedding inspection – After training, the models’ internal code representations are projected into 2‑D using t‑SNE. The authors look for clustering of vulnerable vs. non‑vulnerable snippets.
Real‑world test set (VentiVul) – 20 vulnerability patches from the Linux kernel, collected after the training data cut‑off (May 2025). Each patch is split into “vulnerable” (pre‑fix) and “non‑vulnerable” (post‑fix) snippets.
LLM prompting – The same VentiVul snippets are fed to four LLMs via a zero‑shot prompt asking the model to label the code as vulnerable or safe. No additional fine‑tuning is applied.
Metrics – Standard detection metrics (precision, recall, F1) are reported for each model/dataset combination, with a focus on the drop‑off when moving to VentiVul.

Results & Findings

Model / Dataset	Avg. F1 on Benchmark	F1 on VentiVul
ReVeal (Juliet)	0.84	0.31
ReVeal (Devign)	0.78	0.28
LineVul (BigVul)	0.81	0.34
LineVul (ICVul)	0.77	0.30
Claude 3.5 Sonnet (LLM)	– (zero‑shot)	0.36
GPT‑o3‑mini	–	0.22
GPT‑4o	–	0.38
GPT‑5	–	0.41

Key observations

Embedding collapse – t‑SNE plots show little separation between vulnerable and safe code, indicating that the learned representations are not capturing robust security semantics.
Dataset over‑fitting – High F1 scores on the original benchmarks disappear almost entirely on VentiVul, confirming poor cross‑distribution generalization.
LLMs are not a silver bullet – Even the most advanced LLM (GPT‑5) only marginally outperforms the DL models, and all still miss a majority of real vulnerabilities.
Time‑wise OOD effect – The fact that VentiVul consists of patches written after the training data cut‑off amplifies the distribution shift, mirroring a realistic deployment scenario.

Practical Implications

Security tooling teams should be skeptical of benchmark‑only claims – A model that scores >80 % on Devign may still be useless for day‑to‑day code review.
Dataset quality matters – Curated, up‑to‑date, and diverse code corpora (including recent kernel patches, open‑source libraries, and real‑world CI logs) are essential for training models that survive production drift.
Hybrid approaches – Combining static analysis heuristics with DL/LLM predictions could mitigate false negatives, especially when models are uncertain.
Continuous re‑training – Deployments need pipelines that ingest newly fixed vulnerabilities (e.g., from CVE databases) to keep the model’s knowledge current.
Explainability hooks – Since embeddings are not clearly separating vulnerable code, adding attention‑visualization or program‑analysis‑backed explanations can help developers trust (or reject) model suggestions.

Limitations & Future Work

Small real‑world test set – VentiVul contains only 20 patches; larger, more varied OOD datasets are needed to confirm the trends.
Zero‑shot LLM evaluation – The study does not explore fine‑tuning or few‑shot prompting, which could improve LLM performance.
Focus on C/Linux kernel – Results may differ for other languages or ecosystems (e.g., JavaScript, Rust).
Representation analysis limited to t‑SNE – More rigorous probing (e.g., linear separability tests, mutual information) could better characterize embedding quality.
Future directions suggested by the authors include building a continuously updated “vulnerability stream” dataset, exploring contrastive learning objectives for code security, and integrating dynamic execution traces to enrich model inputs.

Authors

Chaomeng Lu
Bert Lagaisse

Paper Information

arXiv ID: 2512.10485v1
Categories: cs.CR, cs.LG, cs.SE
Published: December 11, 2025
PDF: Download PDF

[Paper] From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously