[Paper] PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design

Published: 1 month ago (December 16, 2025 at 04:37 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14233v1

Overview

The paper introduces PentestEval, the first benchmark that systematically measures how well large language models (LLMs) can automate penetration testing. By breaking the testing workflow into six distinct stages and providing a large, expert‑curated dataset, the authors expose the current strengths and glaring weaknesses of popular LLMs when they are asked to act like security analysts.

Key Contributions

Modular benchmark design – Six clearly defined penetration‑testing stages (info collection, weakness gathering & filtering, attack decision‑making, exploit generation, exploit revision, and overall success) enable fine‑grained performance analysis.
Large, realistic dataset – 346 annotated tasks spanning 12 vulnerable real‑world scenarios, complete with ground‑truth exploit steps and outcomes.
Fully automated evaluation pipeline – Scripts that execute LLM outputs, verify exploit success, and compute stage‑level metrics without human intervention.
Comprehensive LLM comparison – Empirical study of nine widely used LLMs (including GPT‑4, Claude, LLaMA‑2, etc.) and three existing LLM‑powered pentesting tools (PentestGPT, PentestAgent, VulnBot).
Insightful findings on modularization – Demonstrates that a modular, stage‑by‑stage approach dramatically improves success rates versus monolithic “black‑box” prompting.

Methodology

Task Decomposition – The authors mapped a typical penetration‑testing workflow into six sequential stages, each with a concrete input/output contract (e.g., “given a target IP, list open services”).
Scenario Construction – Twelve vulnerable environments were built using common CVEs and misconfigurations (e.g., outdated web servers, insecure Docker setups).
Ground‑Truth Annotation – Security experts manually performed each stage on every scenario, producing the “gold‑standard” outputs (service fingerprints, CVE IDs, exploit scripts, etc.).
Prompt Templates – For each stage, a set of carefully engineered prompts was created (few‑shot examples, system messages, and explicit instructions).
LLM Evaluation – Nine LLMs were queried with the same prompts. Their responses were fed into an automated pipeline that:
- Parses the output,
- Executes generated commands or exploit code in a sandbox,
- Checks whether the exploit succeeded (e.g., shell access, privilege escalation).
Metrics – Stage‑level accuracy (precision/recall for information gathering), decision‑making correctness, exploit generation success rate, and overall end‑to‑end success percentage.

Results & Findings

Stage	Best‑performing LLM (accuracy)	Typical failure mode
Information Collection	GPT‑4 (78%)	Missed hidden services, noisy output
Weakness Gathering & Filtering	Claude‑2 (62%)	Over‑generating CVE IDs, low relevance
Attack Decision‑Making	LLaMA‑2‑70B (55%)	Choosing non‑exploitable vectors
Exploit Generation	GPT‑4 (48%)	Syntax errors, missing payloads
Exploit Revision	Claude‑2 (41%)	Inability to adapt to sandbox feedback
End‑to‑End Success	—	31% (overall pipeline)

Key takeaways

Even the strongest LLM (GPT‑4) fails on more than two‑thirds of the tasks when evaluated stage‑by‑stage.
Autonomous agents built on these models (PentestGPT, PentestAgent, VulnBot) barely succeed at any stage, confirming that “prompt‑only” automation is insufficient.
Modularizing the workflow (running each stage with a dedicated prompt) improves success from <10 % (monolithic) to ~31 % overall, but the ceiling remains low.

Practical Implications

Tool developers should adopt a pipeline architecture: separate LLM calls for reconnaissance, vulnerability mapping, and exploit crafting, rather than a single “do‑everything” prompt.
Security teams can use PentestEval as a sanity check before trusting LLM‑generated scripts in production; the benchmark highlights where human review is still mandatory (e.g., exploit generation).
CI/CD security integrations could embed stage‑level LLM checks to automatically flag missing patches or misconfigurations, but must retain a fallback to expert validation.
LLM vendors now have a concrete, reproducible test suite to benchmark future model releases for security‑oriented tasks, encouraging fine‑tuning on penetration‑testing data.

Limitations & Future Work

Scenario diversity – The benchmark covers 12 setups; while varied, it does not span all network topologies, cloud services, or IoT devices.
Prompt engineering bias – Results depend on the quality of the handcrafted prompts; alternative prompt designs could shift performance.
Sandbox realism – Execution occurs in isolated containers, which may not capture timing‑side‑channel or hardware‑level nuances of real attacks.
Future directions suggested by the authors include: expanding the dataset to cover more modern attack surfaces (e.g., Kubernetes, serverless), incorporating reinforcement‑learning‑based agents that can iteratively refine exploits, and exploring model fine‑tuning on stage‑specific security corpora.

Authors

Ruozhao Yang
Mingfei Cheng
Gelei Deng
Tianwei Zhang
Junjie Wang
Xiaofei Xie

Paper Information

arXiv ID: 2512.14233v1
Categories: cs.SE, cs.AI, cs.CR
Published: December 16, 2025
PDF: Download PDF

[Paper] PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy