[Paper] A Dual-Loop Agent Framework for Automated Vulnerability Reproduction

Published: (February 5, 2026 at 09:47 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.05721v1

Overview

The paper introduces Cve2PoC, a novel dual‑loop framework that leverages large language model (LLM) agents to automatically turn CVE descriptions into working proof‑of‑concept (PoC) exploits. By separating strategic planning from tactical code generation, the system dramatically reduces the manual effort and expertise traditionally required to reproduce vulnerabilities.

Key Contributions

  • Dual‑loop architecture (Strategic Planner ↔ Tactical Executor ↔ Adaptive Refiner) that routes failures to the appropriate level of remediation.
  • Plan‑execute‑evaluate paradigm that first creates a high‑level attack plan, then incrementally builds and validates PoC code.
  • Progressive verification within the Tactical Loop, allowing early detection of syntax or API‑usage errors before full execution.
  • Empirical validation on two large benchmarks (SecBench.js – 617 CVEs, PatchEval – 617 CVEs) showing state‑of‑the‑art reproduction rates (82.9 % and 54.3 %).
  • Human‑centered evaluation confirming that generated PoCs match human‑written exploits in readability and reusability.

Methodology

  1. Strategic Planner – An LLM (e.g., GPT‑4) parses the CVE text and the target codebase, extracts the vulnerability semantics (e.g., “use‑after‑free in malloc”), and produces a structured attack plan (steps, required primitives, entry points).
  2. Tactical Executor – A second LLM takes the plan and incrementally writes PoC snippets. After each snippet, a lightweight sandbox runs a progressive verification (syntax check → unit‑test → full exploit run).
  3. Adaptive Refiner – An evaluation module inspects the sandbox output.
    • If the failure is code‑level (e.g., missing import, wrong API usage), the loop stays in the Tactical branch to refine the snippet.
    • If the failure is strategy‑level (e.g., the attack vector does not trigger the bug), the system jumps back to the Strategic Planner to revise the high‑level plan.
  4. The process repeats until the PoC either reproduces the vulnerability or a timeout/iteration limit is reached.

The separation of concerns prevents the classic “debug‑the‑code‑while‑still‑guessing‑the‑attack” dead‑end that plagues prior single‑loop LLM approaches.

Results & Findings

Benchmark# CVEsSuccess Rate (Cve2PoC)Best BaselineGain
SecBench.js61782.9 %71.6 %+11.3 %
PatchEval61754.3 %33.9 %+20.4 %
  • Speed: Average time to a successful PoC was ~2.3 min per CVE, ~30 % faster than the baseline.
  • Code Quality: Human reviewers rated the generated PoCs on a 5‑point Likert scale; scores for readability (4.3) and reusability (4.1) were statistically indistinguishable from human‑written exploits.
  • Failure Distribution: 68 % of failures were resolved in the Tactical Loop, while 32 % required strategic replanning, confirming the usefulness of the dual‑loop split.

Practical Implications

  • Security Operations Centers (SOCs): Automating PoC generation can accelerate triage pipelines, allowing analysts to focus on impact assessment rather than low‑level exploit coding.
  • Vulnerability Management Platforms: Integration of Cve2PoC can enrich CVE entries with executable demos, improving communication with developers and auditors.
  • Bug‑Bounty & Red‑Team Tooling: Teams can quickly prototype exploits for newly disclosed bugs, shortening the feedback loop between discovery and disclosure.
  • Education & Training: The framework can serve as a teaching aid, automatically producing step‑by‑step exploit walkthroughs for classroom labs.
  • Patch Verification: By reproducing the vulnerability before and after a patch, organizations can automatically confirm that a fix truly mitigates the issue.

Limitations & Future Work

  • Language Scope: The current implementation focuses on JavaScript (SecBench.js) and C/C++ (PatchEval). Extending to other ecosystems (e.g., Java, Rust) will require additional prompt engineering and sandbox support.
  • LLM Dependence: Performance hinges on the underlying LLM’s knowledge of APIs and security primitives; outdated or proprietary libraries may cause hallucinations.
  • Resource Overhead: Running sandboxed executions for each refinement step can be compute‑intensive; future work could explore static analysis shortcuts to prune unpromising paths earlier.
  • Adversarial Robustness: An attacker could potentially feed malformed CVE texts to confuse the planner; hardening the parsing stage is an open research direction.

Overall, Cve2PoC demonstrates that a thoughtfully structured LLM agent loop can turn the once‑manual art of exploit writing into a scalable, repeatable process—opening new avenues for faster, more reliable vulnerability management.

Authors

  • Bin Liu
  • Yanjie Zhao
  • Zhenpeng Chen
  • Guoai Xu
  • Haoyu Wang

Paper Information

  • arXiv ID: 2602.05721v1
  • Categories: cs.SE
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »