[Paper] Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions
Source: arXiv - 2605.06279v1
Overview
Large language models (LLMs) are now a common co‑pilot for developers, often spitting out Python snippets that include explicit third‑party library versions. This paper presents the first systematic, large‑scale measurement of how those version choices affect security and compatibility. By probing ten popular LLMs on a curated benchmark of 1,000 real‑world Stack Overflow tasks, the authors uncover a hidden risk surface: many LLM‑suggested versions are already known to be vulnerable or incompatible.
Key Contributions
- PinTrace benchmark – a publicly released dataset of 1,000 Python coding tasks (derived from Stack Overflow) with ground‑truth library requirements.
- Empirical measurement of version‑level risk across 10 LLMs, covering both code‑generation prompts and manifest‑file generation.
- Vulnerability exposure analysis showing that 36 %–56 % of generated tasks contain at least one known CVE, with 63 %–74 % of those CVEs rated Critical or High.
- Compatibility assessment (static dependency resolution and dynamic test execution) revealing 19 %–63 % static success rates and 6 %–49 % dynamic pass rates.
- Root‑cause experiments confirming that the failures stem from the chosen library versions rather than the quality of the generated code.
- Mitigation insight: anchoring version constraints to external, up‑to‑date sources dramatically cuts both vulnerability exposure and install‑time failures.
Methodology
- Task Selection – The authors harvested 1,000 Python programming problems from Stack Overflow, each paired with the “canonical” solution and its required third‑party libraries.
- Prompt Design – For each task, they issued two kinds of prompts to each LLM:
- Direct code generation (e.g., “Write a script that does X”).
- Manifest generation (e.g., “Create a
requirements.txtfor the solution”).
- Model Suite – Ten LLMs spanning open‑source and commercial offerings (e.g., GPT‑4, Claude, LLaMA‑based models) were evaluated using the same prompts.
- Version Extraction – The generated snippets were parsed to collect any explicit
package==x.y.zspecifications. - Vulnerability Mapping – Each version was cross‑referenced against the National Vulnerability Database (NVD) to flag known CVEs and their severity.
- Compatibility Checks –
- Static: Dependency solvers (pip‑deptree, poetry) attempted to resolve the full dependency graph.
- Dynamic: The generated code was executed against the specified versions in an isolated container; test cases from the original Stack Overflow post were run to see if the solution actually works.
- Bias & Mitigation Experiments – The team swapped out the LLM‑chosen versions with the latest safe releases (or with versions suggested by a “version‑anchor” service) to measure the impact on vulnerability and install‑time success.
All scripts, data, and analysis notebooks are released under an open‑science license.
Results & Findings
| Metric | Direct Code Prompt | Manifest Prompt |
|---|---|---|
| Version spec occurrence | 26.8 % – 95.2 % of responses include a version pin | 6.5 % – 59.2 % |
| Tasks with at least one CVE | 36.7 % – 55.7 % | — |
| Critical/High CVEs among those | 62.8 % – 74.5 % | — |
| CVE disclosed before model cutoff | 72.3 % – 91.4 % | — |
| Static compatibility | 19.7 % – 63.2 % | — |
| Dynamic pass rate | 6.5 % – 48.6 % | — |
Key takeaways
- Systemic bias: Across all models, the same handful of outdated, vulnerable releases (e.g.,
requests==2.19.0,numpy==1.16.0) dominate the suggestions. - Security‑first failure: The majority of CVEs are high‑severity and were publicly known well before the LLM’s training cut‑off, indicating that the models are “memorizing” stale dependency data.
- Compatibility is a bigger practical blocker: Installation failures (missing wheels, dependency conflicts) are the leading cause of dynamic test failures, not logical bugs in the generated code.
- Anchoring works: When the version constraints are replaced with the latest non‑vulnerable releases (or left unspecified for the package manager to resolve), both vulnerability exposure and install‑time failures drop by >80 %.
Practical Implications
- Tooling Adjustments – IDE plugins and CI pipelines that surface LLM‑generated snippets should automatically strip explicit version pins or replace them with a safe “latest” tag, unless the developer explicitly requests a specific version.
- Security Audits – Organizations that have adopted LLM‑assisted coding need to add a dependency‑version scan to their SBOM generation step, treating version pins as a first‑class security artifact.
- Model Providers – Vendors can improve safety by post‑processing model outputs with a vulnerability database lookup, warning users when a suggested version is known to be insecure.
- Developer Education – The findings reinforce the old best practice: don’t hard‑code library versions unless you have a reason; let package managers resolve the most recent compatible releases.
- Open‑source Ecosystem – The PinTrace dataset offers a ready‑made benchmark for future research on “dependency‑aware” LLMs, encouraging the community to build models that reason about version safety.
Limitations & Future Work
- Language & Ecosystem Scope – The study focuses exclusively on Python; other ecosystems (Node.js, Java, Rust) may exhibit different version‑selection patterns.
- Prompt Diversity – Only two prompt styles were examined; real‑world developer interactions can be more nuanced (e.g., multi‑turn conversations, partial code edits).
- Static vs. Dynamic Gap – The static resolver sometimes reports success while the dynamic test fails due to runtime incompatibilities not captured by the resolver. More sophisticated environment simulation could narrow this gap.
- Model Knowledge Cutoff – The analysis assumes the model’s training data ends at a fixed date; newer models with more recent data may behave differently, but the systemic bias observed suggests the problem will persist unless explicitly mitigated.
- Future Directions – Extending the benchmark to other languages, integrating real‑time vulnerability feeds into LLM generation pipelines, and exploring reinforcement‑learning‑based fine‑tuning to teach models “safe version selection.”
The authors have open‑sourced the PinTrace benchmark and all analysis scripts, inviting developers, security teams, and AI researchers to build safer LLM‑driven development tools.
Authors
- Chengjie Wang
- Jingzheng Wu
- Xiang Ling
- Tianyue Luo
- Chen Zhao
Paper Information
- arXiv ID: 2605.06279v1
- Categories: cs.SE, cs.AI
- Published: May 7, 2026
- PDF: Download PDF