[Paper] CHASE: LLM Agents for Dissecting Malicious PyPI Packages
Source: arXiv - 2601.06838v1
Overview
The paper introduces CHASE, a multi‑agent system that harnesses large language models (LLMs) to automatically dissect and flag malicious Python packages on PyPI. By combining LLM‑driven semantic analysis with deterministic security tools, CHASE achieves near‑human‑level detection accuracy while keeping analysis time short enough for real‑world CI/CD pipelines.
Key Contributions
- Collaborative Hierarchical Agent Architecture – a “Plan‑and‑Execute” framework that coordinates a central Planner with specialized Worker Agents (e.g., static analysis, dependency graphing, behavior simulation).
- Reliability‑by‑Design – isolates LLM‑prone errors (hallucination, context loss) by delegating safety‑critical steps to proven security tools (e.g., sandboxed execution, signature scanners).
- High‑Performance Evaluation – on a curated dataset of 3 k packages (500 malicious), CHASE reaches 98.4 % recall and a 0.08 % false‑positive rate, with a median runtime of 4.5 minutes per package.
- Human‑Centric Report Generation – produces structured analysis reports that were validated through a survey of cybersecurity professionals, highlighting usability for security teams.
- Open‑Source Blueprint – the authors release code, data, and a demo site, offering a practical starting point for building AI‑augmented supply‑chain defenses.
Methodology
- Planning Layer – a central LLM receives the package metadata (name, version, description) and decides which analysis steps are needed. It creates a task graph that assigns work to the appropriate agents.
- Worker Agents – each agent is a lightweight LLM instance fine‑tuned for a narrow sub‑task:
- Static Code Agent parses source files, extracts imports, and flags suspicious patterns.
- Dependency Agent builds a full dependency tree and checks for known compromised libraries.
- Dynamic Agent runs the package in a sandbox, logs system calls, and looks for malicious behaviors.
- Deterministic Guardrails – whenever a worker must make a security‑critical decision (e.g., “does this call open a network socket?”), the system invokes a traditional tool (e.g.,
strace, signature database) instead of relying on the LLM’s judgment. - Result Aggregation – the Planner consolidates the agents’ outputs, applies a simple voting/weighting scheme, and generates a human‑readable report.
- Feedback Loop – false‑positive/negative cases collected from the survey are fed back to fine‑tune the agents and adjust the planning heuristics.
Results & Findings
| Metric | Value |
|---|---|
| Recall (malicious detection) | 98.4 % |
| False‑positive rate | 0.08 % |
| Median analysis time per package | 4.5 min |
| Average report satisfaction (survey) | 4.2 / 5 |
Key observations:
- The hierarchical design dramatically reduces hallucination‑induced errors; most missed detections stem from obscure obfuscation techniques not covered by the sandbox.
- Integrating deterministic tools for low‑level system calls cuts false positives by an order of magnitude compared to a pure‑LLM baseline.
- Security analysts appreciated the structured “attack‑chain” view produced by CHASE, which speeds up triage and remediation.
Practical Implications
- Automated Package Screening – CI/CD systems can plug CHASE into their dependency‑resolution step, automatically rejecting or quarantining suspicious wheels before they reach production.
- Supply‑Chain Risk Management – security teams gain a scalable way to monitor the entire PyPI ecosystem, focusing human effort on the few high‑confidence alerts.
- Extensible Framework – the agent‑based architecture can be adapted to other ecosystems (npm, Maven) by swapping language‑specific Workers while keeping the Planner logic unchanged.
- Compliance & Auditing – generated reports provide evidence for regulatory audits (e.g., SOC 2, ISO 27001) by documenting the exact analysis path taken for each package.
- Cost‑Effective Defense – with a median runtime of under 5 minutes, organizations can run CHASE nightly on all new dependencies without prohibitive compute costs.
Limitations & Future Work
- Obfuscation & Packed Code – CHASE struggles with heavily obfuscated payloads that evade static parsing and sandbox instrumentation.
- LLM Dependency – the system’s planning quality hinges on the underlying LLM’s prompt engineering; model updates may require re‑tuning.
- Scalability to Massive Registries – while feasible for per‑project scans, scanning the entire PyPI index in real time would need distributed execution and caching strategies.
- Cross‑Language Extensions – future work includes generalizing the agent hierarchy to other package managers and integrating threat‑intel feeds for richer context.
Overall, CHASE demonstrates that a thoughtfully engineered combination of LLMs and traditional security tooling can deliver reliable, production‑grade malware detection for modern software supply chains.
Authors
- Takaaki Toda
- Tatsuya Mori
Paper Information
- arXiv ID: 2601.06838v1
- Categories: cs.CR, cs.SE
- Published: January 11, 2026
- PDF: Download PDF