[Paper] What Makes a Good LLM Agent for Real-world Penetration Testing?
Source: arXiv - 2602.17622v1
Overview
The paper investigates why large‑language‑model (LLM) agents perform so inconsistently when tasked with real‑world penetration testing. By dissecting 28 existing systems and testing five representative agents, the authors pinpoint two fundamental failure modes and propose Excalibur, a new agent that combines robust tooling with “difficulty‑aware” planning to dramatically improve attack success rates.
Key Contributions
- Systematic taxonomy of failures: Identifies Type A (tooling & prompt gaps) and Type B (planning & state‑management gaps) failures across a wide range of LLM‑based pentesting agents.
- Root cause analysis of Type B failures: Shows that all agents, regardless of model size, lack real‑time task‑difficulty estimation, leading to wasted effort and context overflow.
- Excalibur architecture:
- Tool & Skill Layer – typed interfaces and retrieval‑augmented knowledge that eliminate Type A failures.
- Task Difficulty Assessment (TDA) – quantifies four dimensions (horizon, evidence confidence, context load, historical success) to estimate tractability of each sub‑task.
- Evidence‑Guided Attack Tree Search (EGATS) – uses TDA scores to balance exploration vs. exploitation during attack planning.
- Empirical validation: Demonstrates up to 91 % task completion on CTF benchmarks (39‑49 % relative gain) and compromises 4/5 hosts in the GOAD Active Directory environment, outperforming prior state‑of‑the‑art agents.
- Insight on scaling limits: Shows that merely using larger LLMs does not solve Type B failures; intelligent planning is required.
Methodology
- Survey & Classification – Collected 28 publicly available LLM‑based pentesting systems, categorizing their architectures, toolsets, and prompting strategies.
- Benchmark Suite – Built three progressively harder testbeds:
- Simple CTF‑style challenges (single‑step exploits).
- Multi‑step attack chains requiring state tracking.
- Realistic enterprise scenario (GOAD Active Directory).
- Failure Mode Diagnosis – Traced each agent’s failures back to missing capabilities (Type A) or to poor decision‑making and context handling (Type B).
- Design of Excalibur –
- Tool & Skill Layer supplies a typed API catalog (e.g., port scanner, credential dumper) and augments prompts with retrieved knowledge from security databases.
- Task Difficulty Assessment computes a scalar difficulty score per candidate sub‑task using:
- Horizon estimation (how many steps ahead are needed)
- Evidence confidence (certainty of gathered intel)
- Context load (size of prompt/context consumed)
- Historical success (past success rate on similar tasks)
- EGATS uses these scores to prioritize low‑cost, high‑value branches in the attack tree, pruning paths that would exceed context limits.
- Evaluation – Ran Excalibur and baseline agents on the three benchmarks, measuring task completion ratio, number of compromised hosts, and LLM token usage.
Results & Findings
| Benchmark | Baseline Avg. Completion | Excalibur Completion | Relative Gain |
|---|---|---|---|
| Simple CTF | ~55 % | 91 % | +39 % |
| Multi‑step CTF | ~48 % | 84 % | +46 % |
| GOAD AD (5 hosts) | 2 compromised | 4 compromised | +100 % |
- Token efficiency: Difficulty‑aware pruning reduced context overflow by ~30 %, allowing the LLM to stay within token limits longer.
- Model‑agnostic improvement: Gains were consistent across frontier LLMs (GPT‑4, Claude‑2, Llama‑2‑70B), confirming that the advantage stems from planning, not raw model size.
- Failure reduction: Type A failures dropped to near‑zero thanks to the standardized tool layer; Type B failures were cut by >70 % after integrating TDA.
Practical Implications
- More reliable automated red‑team tools: Security teams can adopt Excalibur‑style agents to continuously probe internal assets without the high false‑positive rate of current LLM bots.
- Cost‑effective pentesting: By avoiding wasted token consumption, organizations can run large‑scale assessments on cheaper LLM APIs while still achieving deep coverage.
- Framework for other domains: The difficulty‑aware planning paradigm can be transplanted to LLM agents in DevOps (e.g., automated incident response), code synthesis, or data‑pipeline orchestration where task branching and context limits are critical.
- Better integration with existing tooling: The typed Tool & Skill Layer aligns naturally with security frameworks (Metasploit, Nmap, BloodHound), simplifying deployment in CI/CD pipelines for continuous security validation.
Limitations & Future Work
- Domain specificity: Current implementation focuses on network‑level penetration testing; extending to web‑application or cloud‑native attack surfaces may require new tool abstractions.
- Difficulty metric calibration: TDA relies on heuristics (e.g., horizon estimation) tuned on the authors’ benchmarks; broader validation is needed for diverse enterprise environments.
- Real‑time adaptation: Excalibur estimates difficulty before execution but does not yet adjust on‑fly when unexpected evidence appears—a promising direction for future research.
- Human‑in‑the‑loop safety: The paper does not address safeguards against malicious misuse of highly capable agents; integrating policy‑driven constraints is an open challenge.
Authors
- Gelei Deng
- Yi Liu
- Yuekang Li
- Ruozhao Yang
- Xiaofei Xie
- Jie Zhang
- Han Qiu
- Tianwei Zhang
Paper Information
- arXiv ID: 2602.17622v1
- Categories: cs.CR, cs.SE
- Published: February 19, 2026
- PDF: Download PDF