[Paper] Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification
Source: arXiv - 2602.16304v1
Overview
Open‑source package ecosystems like PyPI are increasingly weaponised by attackers who slip malicious code into otherwise legitimate libraries. The paper “Mind the Gap: Evaluating LLMs for High‑Level Malicious Package Detection vs. Fine‑Grained Indicator Identification” investigates whether today’s Large Language Models (LLMs) can reliably spot these rogue packages—and, crucially, whether they can also pinpoint the exact malicious behaviours hidden inside them. The authors discover a striking “granularity gap”: LLMs excel at saying “this package is bad” but stumble when asked to name the specific harmful patterns it contains.
Key Contributions
- Systematic benchmark of 13 LLMs on a curated dataset of 4,070 Python packages (3,700 benign, 370 malicious).
- Two‑tiered evaluation:
- Binary classification (is the package malicious?).
- Multi‑label classification (which malicious indicators are present?).
- Prompt engineering analysis covering zero‑shot, few‑shot, temperature variations, and model‑specific prompting styles.
- Discovery of the “granularity gap” – a ~41 % drop in F1‑score when moving from package‑level detection to indicator‑level identification, even for the strongest model (GPT‑4.1).
- Insight into model specialisation: general‑purpose LLMs are best at coarse‑grained filtering, while coder‑oriented models (e.g., CodeLlama, StarCoder) outperform on attacks that follow predictable code patterns.
- Correlation study showing that sheer parameter count or context window size does not explain detection performance.
Methodology
- Dataset construction – The authors gathered 370 known malicious PyPI packages (e.g., credential stealers, backdoors) and paired them with 3,700 benign packages sampled across popularity tiers. Each package was manually annotated with up to 12 fine‑grained malicious indicators (e.g., “exec‑shell”, “obfuscated strings”, “network exfiltration”).
- Prompt design – For each LLM they crafted a baseline prompt (“Is this package malicious?”) and a richer prompt that listed possible indicators and asked the model to output a comma‑separated list. They also experimented with few‑shot examples and temperature settings (0.0, 0.7, 1.0).
- Evaluation metrics – Binary detection used precision, recall, and F1. Multi‑label detection employed micro‑averaged F1 and exact‑match accuracy.
- Statistical analysis – Linear regression and Spearman correlation were used to test whether model size, context length, or training corpus (general vs. code‑focused) predicted performance.
The pipeline is deliberately lightweight: a package’s setup.py/pyproject.toml and source files are concatenated (up to the model’s context limit) and fed to the LLM via the API, mimicking a realistic “on‑the‑fly” security scan.
Results & Findings
| Model (type) | Binary F1 | Multi‑label micro‑F1 | Δ (drop) |
|---|---|---|---|
| GPT‑4.1 (general) | 0.99 | 0.58 | ≈ 41 % |
| Claude‑3 (general) | 0.96 | 0.55 | 41 % |
| CodeLlama‑34B (coder) | 0.92 | 0.63 | 31 % |
| StarCoder‑16B (coder) | 0.90 | 0.61 | 32 % |
| Smaller open‑source LLMs (≤7B) | 0.78‑0.84 | 0.34‑0.42 | 45‑50 % |
Take‑aways
- Near‑perfect binary detection is achievable with top‑tier LLMs, even with zero‑shot prompts.
- Indicator identification suffers dramatically; models often output “none” or miss subtle patterns like “dynamic import” or “obfuscated bytecode”.
- Prompt richness helps: few‑shot examples improve multi‑label F1 by ~6 % on average, but never close the gap to binary performance.
- Coder‑specialised models close the gap slightly for attacks that follow a known code template (e.g., classic “setup‑script backdoor”), but they still lag behind on more creative obfuscation.
- Model size & context window show negligible correlation (Spearman ρ ≈ 0.07), suggesting architecture and training data composition matter more than raw scale.
Practical Implications
- First‑line defence – Deploy a strong general‑purpose LLM (e.g., GPT‑4.1) as a fast filter in CI pipelines to flag suspicious packages before they reach downstream developers. The low false‑negative rate makes it suitable for “gatekeeper” roles.
- Complementary tooling – Because LLMs can’t reliably enumerate the exact malicious behaviours, they should be paired with static analysis, sandboxing, or signature‑based scanners that can surface the fine‑grained indicators.
- Prompt engineering as a product feature – Security‑as‑a‑service platforms can expose configurable prompt templates (zero‑shot vs. few‑shot) to let teams trade off speed for deeper insight.
- Coder‑model selection for niche threats – Organizations that frequently encounter “template‑driven” supply‑chain attacks (e.g., malicious
setup.pythat injects a known backdoor) may benefit from integrating a coder‑oriented LLM for secondary analysis. - Cost‑benefit awareness – Running GPT‑4.1 on every package upload can be expensive; the study shows that smaller coder models achieve comparable indicator‑level performance for a subset of attacks, offering a cheaper tiered approach.
In short, LLMs are ready to become the “security triage” layer in modern DevOps, but they are not yet a replacement for deep code‑level inspection.
Limitations & Future Work
- Dataset bias – The malicious sample set, while curated, leans heavily toward known PyPI‑based attacks; emerging threat vectors (e.g., supply‑chain attacks via compiled wheels) are not represented.
- Context truncation – Packages exceeding the model’s context window are clipped, potentially discarding critical code sections.
- Prompt scope – Only English‑language prompts were explored; multilingual or domain‑specific prompts could affect performance.
- Model updates – The evaluation reflects a snapshot of each LLM; rapid iteration (e.g., GPT‑4.2) may shift the granularity gap.
- Future directions suggested by the authors include:
- Training or fine‑tuning LLMs on a dedicated “malicious‑package” corpus,
- Integrating retrieval‑augmented generation (RAG) to pull in external threat intel during inference, and
- Expanding the indicator taxonomy to cover binary‑level exploits (e.g., supply‑chain “typosquatting”).
Authors
- Ahmed Ryan
- Ibrahim Khalil
- Abdullah Al Jahid
- Md Erfan
- Akond Ashfaque Ur Rahman
- Md Rayhanur Rahman
Paper Information
- arXiv ID: 2602.16304v1
- Categories: cs.CR, cs.SE
- Published: February 18, 2026
- PDF: Download PDF