[Paper] PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design
Source: arXiv - 2512.14233v1
Overview
The paper introduces PentestEval, the first benchmark that systematically measures how well large language models (LLMs) can automate penetration testing. By breaking the testing workflow into six distinct stages and providing a large, expert‑curated dataset, the authors expose the current strengths and glaring weaknesses of popular LLMs when they are asked to act like security analysts.
Key Contributions
- Modular benchmark design – Six clearly defined penetration‑testing stages (info collection, weakness gathering & filtering, attack decision‑making, exploit generation, exploit revision, and overall success) enable fine‑grained performance analysis.
- Large, realistic dataset – 346 annotated tasks spanning 12 vulnerable real‑world scenarios, complete with ground‑truth exploit steps and outcomes.
- Fully automated evaluation pipeline – Scripts that execute LLM outputs, verify exploit success, and compute stage‑level metrics without human intervention.
- Comprehensive LLM comparison – Empirical study of nine widely used LLMs (including GPT‑4, Claude, LLaMA‑2, etc.) and three existing LLM‑powered pentesting tools (PentestGPT, PentestAgent, VulnBot).
- Insightful findings on modularization – Demonstrates that a modular, stage‑by‑stage approach dramatically improves success rates versus monolithic “black‑box” prompting.
Methodology
- Task Decomposition – The authors mapped a typical penetration‑testing workflow into six sequential stages, each with a concrete input/output contract (e.g., “given a target IP, list open services”).
- Scenario Construction – Twelve vulnerable environments were built using common CVEs and misconfigurations (e.g., outdated web servers, insecure Docker setups).
- Ground‑Truth Annotation – Security experts manually performed each stage on every scenario, producing the “gold‑standard” outputs (service fingerprints, CVE IDs, exploit scripts, etc.).
- Prompt Templates – For each stage, a set of carefully engineered prompts was created (few‑shot examples, system messages, and explicit instructions).
- LLM Evaluation – Nine LLMs were queried with the same prompts. Their responses were fed into an automated pipeline that:
- Parses the output,
- Executes generated commands or exploit code in a sandbox,
- Checks whether the exploit succeeded (e.g., shell access, privilege escalation).
- Metrics – Stage‑level accuracy (precision/recall for information gathering), decision‑making correctness, exploit generation success rate, and overall end‑to‑end success percentage.
Results & Findings
| Stage | Best‑performing LLM (accuracy) | Typical failure mode |
|---|---|---|
| Information Collection | GPT‑4 (78%) | Missed hidden services, noisy output |
| Weakness Gathering & Filtering | Claude‑2 (62%) | Over‑generating CVE IDs, low relevance |
| Attack Decision‑Making | LLaMA‑2‑70B (55%) | Choosing non‑exploitable vectors |
| Exploit Generation | GPT‑4 (48%) | Syntax errors, missing payloads |
| Exploit Revision | Claude‑2 (41%) | Inability to adapt to sandbox feedback |
| End‑to‑End Success | — | 31% (overall pipeline) |
Key takeaways
- Even the strongest LLM (GPT‑4) fails on more than two‑thirds of the tasks when evaluated stage‑by‑stage.
- Autonomous agents built on these models (PentestGPT, PentestAgent, VulnBot) barely succeed at any stage, confirming that “prompt‑only” automation is insufficient.
- Modularizing the workflow (running each stage with a dedicated prompt) improves success from <10 % (monolithic) to ~31 % overall, but the ceiling remains low.
Practical Implications
- Tool developers should adopt a pipeline architecture: separate LLM calls for reconnaissance, vulnerability mapping, and exploit crafting, rather than a single “do‑everything” prompt.
- Security teams can use PentestEval as a sanity check before trusting LLM‑generated scripts in production; the benchmark highlights where human review is still mandatory (e.g., exploit generation).
- CI/CD security integrations could embed stage‑level LLM checks to automatically flag missing patches or misconfigurations, but must retain a fallback to expert validation.
- LLM vendors now have a concrete, reproducible test suite to benchmark future model releases for security‑oriented tasks, encouraging fine‑tuning on penetration‑testing data.
Limitations & Future Work
- Scenario diversity – The benchmark covers 12 setups; while varied, it does not span all network topologies, cloud services, or IoT devices.
- Prompt engineering bias – Results depend on the quality of the handcrafted prompts; alternative prompt designs could shift performance.
- Sandbox realism – Execution occurs in isolated containers, which may not capture timing‑side‑channel or hardware‑level nuances of real attacks.
- Future directions suggested by the authors include: expanding the dataset to cover more modern attack surfaces (e.g., Kubernetes, serverless), incorporating reinforcement‑learning‑based agents that can iteratively refine exploits, and exploring model fine‑tuning on stage‑specific security corpora.
Authors
- Ruozhao Yang
- Mingfei Cheng
- Gelei Deng
- Tianwei Zhang
- Junjie Wang
- Xiaofei Xie
Paper Information
- arXiv ID: 2512.14233v1
- Categories: cs.SE, cs.AI, cs.CR
- Published: December 16, 2025
- PDF: Download PDF