[Paper] PenForge: On-the-Fly Expert Agent Construction for Automated Penetration Testing
Source: arXiv - 2601.06910v1
Overview
PenForge tackles a long‑standing pain point in automated security testing: static, one‑size‑fits‑all AI agents either miss complex bugs or can’t generalize across different vulnerability families. By building specialized LLM‑driven agents on the fly, PenForge adapts to the unique context of each target web application, delivering a three‑fold boost in exploit success on a challenging zero‑day benchmark.
Key Contributions
- Dynamic expert‑agent construction: Introduces a pipeline that creates context‑aware LLM agents during a penetration test instead of pre‑defining them.
- Integrated reconnaissance‑to‑exploitation loop: Automatically discovers attack surfaces, selects the most relevant expertise, and spawns a tailored agent to carry out exploitation.
- Empirical breakthrough: Achieves a 30 % exploit success rate (12/40) on the CVE‑Bench zero‑day suite, roughly 3× the best prior LLM‑based system.
- Open research agenda: Highlights three concrete avenues—richer tool‑usage knowledge, broader benchmark coverage, and explainable human‑in‑the‑loop review—to push the field forward.
Methodology
- Automated Reconnaissance – PenForge first runs lightweight scanners (e.g., OWASP ZAP, custom crawlers) to map endpoints, parameters, and technology stacks.
- Context Extraction – The gathered data is fed to a large language model that extracts salient cues (e.g., “uses outdated jQuery”, “exposes admin API”).
- On‑the‑Fly Agent Synthesis – Based on these cues, PenForge prompts the LLM to generate a micro‑agent equipped with the right exploitation tactics and tool commands (e.g., SQLi payload generators, XSS payloads, Metasploit modules).
- Execution & Feedback – The micro‑agent runs the crafted payloads against the target, monitors responses, and iteratively refines its approach using a short‑term memory buffer.
- Result Aggregation – Successful exploits are logged, and the system optionally hands them off to a human analyst for verification.
The whole pipeline runs autonomously, but each step is modular, allowing developers to swap in alternative scanners, LLM back‑ends, or custom toolkits.
Results & Findings
- Success Rate: 12 out of 40 zero‑day CVEs were fully exploited, a 30 % success rate versus ~10 % for the previous best LLM‑based tester.
- Speed: Average time‑to‑exploit per vulnerability dropped from ~8 min (static agents) to ~4 min, thanks to the targeted nature of the generated agents.
- Diversity: PenForge succeeded across a broader spectrum of vulnerability classes (SQL injection, SSRF, deserialization bugs) than static agents, which tended to excel only in a narrow subset.
- Failure Analysis: Most missed exploits stemmed from insufficient knowledge of obscure third‑party tools (e.g., niche fuzzers) and from ambiguous reconnaissance data that led to sub‑optimal agent specialization.
Practical Implications
- Scalable Red‑Team Automation: Security teams can deploy PenForge as a “continuous pen‑test” service that automatically adapts to new code releases without hand‑crafting test scripts for each component.
- Developer‑Friendly Findings: Because each exploit is generated by a context‑aware agent, the resulting proof‑of‑concept payloads are more realistic and easier for developers to reproduce and patch.
- Tool‑Chain Integration: PenForge’s modular design lets DevSecOps pipelines plug it into CI/CD workflows, automatically triggering a reconnaissance‑to‑exploitation run on staging environments.
- Cost Reduction: By reducing reliance on senior manual pentesters for routine vulnerability hunting, organizations can allocate human expertise to higher‑impact threat modeling and remediation.
- Foundation for Explainable AI Security: The on‑the‑fly agent logs the reasoning steps (recon → cue extraction → agent prompt → payload), offering a transparent audit trail that can be presented to auditors or compliance officers.
Limitations & Future Work
- Tool‑Usage Knowledge Gaps: The LLM sometimes generates payloads that assume the presence of tools or libraries not installed on the target, limiting exploit reliability.
- Benchmark Scope: Evaluation was confined to the CVE‑Bench suite; broader, industry‑scale benchmarks (including mobile, API‑first, and cloud‑native services) are needed to validate generality.
- Explainability & Human Oversight: While logs are produced, the current system lacks a polished UI for security analysts to review and intervene, which is crucial for building trust in fully automated testing.
PenForge marks a promising shift toward adaptive, LLM‑driven security automation, and its open research agenda invites the community to refine the approach into a production‑ready, trustworthy component of modern software security arsenals.
Authors
- Huihui Huang
- Jieke Shi
- Junkai Chen
- Ting Zhang
- Yikun Li
- Chengran Yang
- Eng Lieh Ouh
- Lwin Khin Shar
- David Lo
Paper Information
- arXiv ID: 2601.06910v1
- Categories: cs.SE
- Published: January 11, 2026
- PDF: Download PDF