[Paper] Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

Published: 3 days ago (May 7, 2026 at 09:52 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06279v1

Overview

Large language models (LLMs) are now a common co‑pilot for developers, often spitting out Python snippets that include explicit third‑party library versions. This paper presents the first systematic, large‑scale measurement of how those version choices affect security and compatibility. By probing ten popular LLMs on a curated benchmark of 1,000 real‑world Stack Overflow tasks, the authors uncover a hidden risk surface: many LLM‑suggested versions are already known to be vulnerable or incompatible.

Key Contributions

PinTrace benchmark – a publicly released dataset of 1,000 Python coding tasks (derived from Stack Overflow) with ground‑truth library requirements.
Empirical measurement of version‑level risk across 10 LLMs, covering both code‑generation prompts and manifest‑file generation.
Vulnerability exposure analysis showing that 36 %–56 % of generated tasks contain at least one known CVE, with 63 %–74 % of those CVEs rated Critical or High.
Compatibility assessment (static dependency resolution and dynamic test execution) revealing 19 %–63 % static success rates and 6 %–49 % dynamic pass rates.
Root‑cause experiments confirming that the failures stem from the chosen library versions rather than the quality of the generated code.
Mitigation insight: anchoring version constraints to external, up‑to‑date sources dramatically cuts both vulnerability exposure and install‑time failures.

Methodology

Task Selection – The authors harvested 1,000 Python programming problems from Stack Overflow, each paired with the “canonical” solution and its required third‑party libraries.
Prompt Design – For each task, they issued two kinds of prompts to each LLM:
- Direct code generation (e.g., “Write a script that does X”).
- Manifest generation (e.g., “Create a requirements.txt for the solution”).
Model Suite – Ten LLMs spanning open‑source and commercial offerings (e.g., GPT‑4, Claude, LLaMA‑based models) were evaluated using the same prompts.
Version Extraction – The generated snippets were parsed to collect any explicit package==x.y.z specifications.
Vulnerability Mapping – Each version was cross‑referenced against the National Vulnerability Database (NVD) to flag known CVEs and their severity.
Compatibility Checks –
- Static: Dependency solvers (pip‑deptree, poetry) attempted to resolve the full dependency graph.
- Dynamic: The generated code was executed against the specified versions in an isolated container; test cases from the original Stack Overflow post were run to see if the solution actually works.
Bias & Mitigation Experiments – The team swapped out the LLM‑chosen versions with the latest safe releases (or with versions suggested by a “version‑anchor” service) to measure the impact on vulnerability and install‑time success.

All scripts, data, and analysis notebooks are released under an open‑science license.

Results & Findings

Metric	Direct Code Prompt	Manifest Prompt
Version spec occurrence	26.8 % – 95.2 % of responses include a version pin	6.5 % – 59.2 %
Tasks with at least one CVE	36.7 % – 55.7 %	—
Critical/High CVEs among those	62.8 % – 74.5 %	—
CVE disclosed before model cutoff	72.3 % – 91.4 %	—
Static compatibility	19.7 % – 63.2 %	—
Dynamic pass rate	6.5 % – 48.6 %	—

Key takeaways

Systemic bias: Across all models, the same handful of outdated, vulnerable releases (e.g., requests==2.19.0, numpy==1.16.0) dominate the suggestions.
Security‑first failure: The majority of CVEs are high‑severity and were publicly known well before the LLM’s training cut‑off, indicating that the models are “memorizing” stale dependency data.
Compatibility is a bigger practical blocker: Installation failures (missing wheels, dependency conflicts) are the leading cause of dynamic test failures, not logical bugs in the generated code.
Anchoring works: When the version constraints are replaced with the latest non‑vulnerable releases (or left unspecified for the package manager to resolve), both vulnerability exposure and install‑time failures drop by >80 %.

Practical Implications

Tooling Adjustments – IDE plugins and CI pipelines that surface LLM‑generated snippets should automatically strip explicit version pins or replace them with a safe “latest” tag, unless the developer explicitly requests a specific version.
Security Audits – Organizations that have adopted LLM‑assisted coding need to add a dependency‑version scan to their SBOM generation step, treating version pins as a first‑class security artifact.
Model Providers – Vendors can improve safety by post‑processing model outputs with a vulnerability database lookup, warning users when a suggested version is known to be insecure.
Developer Education – The findings reinforce the old best practice: don’t hard‑code library versions unless you have a reason; let package managers resolve the most recent compatible releases.
Open‑source Ecosystem – The PinTrace dataset offers a ready‑made benchmark for future research on “dependency‑aware” LLMs, encouraging the community to build models that reason about version safety.

Limitations & Future Work

Language & Ecosystem Scope – The study focuses exclusively on Python; other ecosystems (Node.js, Java, Rust) may exhibit different version‑selection patterns.
Prompt Diversity – Only two prompt styles were examined; real‑world developer interactions can be more nuanced (e.g., multi‑turn conversations, partial code edits).
Static vs. Dynamic Gap – The static resolver sometimes reports success while the dynamic test fails due to runtime incompatibilities not captured by the resolver. More sophisticated environment simulation could narrow this gap.
Model Knowledge Cutoff – The analysis assumes the model’s training data ends at a fixed date; newer models with more recent data may behave differently, but the systemic bias observed suggests the problem will persist unless explicitly mitigated.
Future Directions – Extending the benchmark to other languages, integrating real‑time vulnerability feeds into LLM generation pipelines, and exploring reinforcement‑learning‑based fine‑tuning to teach models “safe version selection.”

The authors have open‑sourced the PinTrace benchmark and all analysis scripts, inviting developers, security teams, and AI researchers to build safer LLM‑driven development tools.

Authors

Chengjie Wang
Jingzheng Wu
Xiang Ling
Tianyue Luo
Chen Zhao

Paper Information

arXiv ID: 2605.06279v1
Categories: cs.SE, cs.AI
Published: May 7, 2026
PDF: Download PDF

[Paper] Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction