[Paper] PBFuzz: Agentic Directed Fuzzing for PoV Generation

Published: (December 4, 2025 at 04:34 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.04611v1

Overview

The paper introduces PBFuzz, an “agentic” directed fuzzing system that automatically generates Proof‑of‑Vulnerability (PoV) inputs. By mimicking the iterative reasoning of a human security analyst, PBFuzz dramatically speeds up the discovery of exploitable bugs, cutting the time‑to‑exposure from hours to minutes while keeping the cost of large‑language‑model (LLM) calls low.

Key Contributions

  • Agentic fuzzing loop that combines code‑level reasoning, hypothesis generation, and feedback‑driven refinement, emulating how experts hunt for PoVs.
  • Semantic constraint extraction via autonomous code analysis, separating reachability (getting to vulnerable code) from triggering (activating the bug).
  • Custom program‑analysis plug‑ins that feed precise, target‑specific information to the LLM without exposing the whole codebase.
  • Persistent memory store that retains hypotheses across iterations, preventing “drift” and enabling cumulative learning.
  • Property‑based testing engine that solves extracted constraints while preserving the original input structure, improving efficiency over raw symbolic execution.
  • Empirical validation on the Magma benchmark: 57 vulnerabilities triggered (17 uniquely), 25.6× faster than AFL++ + CmpLog, with an average API cost of only $1.83 per vulnerability.

Methodology

  1. Target selection – The system receives a binary and a list of suspected vulnerable locations (e.g., a strcmp check).
  2. Semantic analysis – Lightweight static analyses (control‑flow, data‑flow, taint) automatically infer two sets of constraints:
    • Reachability: conditions needed to execute the vulnerable instruction.
    • Triggering: conditions that cause the vulnerability to manifest (e.g., specific buffer sizes, magic values).
  3. Hypothesis generation – The extracted constraints are fed to an LLM that proposes concrete input “hypotheses” (partial byte sequences, format strings, etc.).
  4. Property‑based testing – Each hypothesis is turned into a test case that is executed against the program under a fuzzing harness. The harness checks whether the constraints are satisfied, using fast instrumentation (e.g., AFL‑style coverage).
  5. Feedback loop – Execution results (crash, coverage, constraint satisfaction) are stored in a persistent memory module. The LLM revisits the memory, refines or discards hypotheses, and generates new candidates.
  6. Termination – The loop stops when a PoV is found (crash with the targeted bug) or the time budget expires.

The whole pipeline runs autonomously, requiring only the initial target specification and a modest LLM API quota.

Results & Findings

MetricPBFuzzBest Baseline (AFL++ + CmpLog)
Vulnerabilities triggered57 (17 exclusive)40
Median time‑to‑exposure339 s8 680 s
Time budget per target30 min24 h
API cost per PoV$1.83N/A (no LLM)
Success rate within budget84 %58 %

Key takeaways

  • Speed – By focusing on semantic constraints, PBFuzz avoids the blind exploration that dominates traditional greybox fuzzers.
  • Coverage of hard bugs – The property‑based testing step preserves input structure, allowing the system to satisfy complex triggering conditions that fuzzers typically miss.
  • Cost‑effectiveness – Even with LLM calls, the per‑vulnerability expense stays under $2, making the approach viable for continuous integration pipelines.

Practical Implications

  • Security‑focused CI/CD – Teams can plug PBFuzz into nightly builds to automatically generate PoVs for newly introduced code paths, catching exploitable bugs before release.
  • Bug‑bounty automation – Researchers can use the tool to quickly produce PoVs for disclosed CVEs, reducing manual reverse‑engineering effort.
  • Fuzzing-as-a‑Service – Cloud providers could offer an “agentic fuzzing” tier that leverages LLM reasoning to deliver faster results for high‑value targets.
  • Toolchain integration – Because PBFuzz works with standard instrumentation (e.g., AFL’s coverage map) and accepts custom analysis plug‑ins, it can be layered on top of existing fuzzers without a complete rewrite.

Overall, the approach bridges the gap between human expertise and automated testing, promising a new class of “smart” fuzzers that can reason about program semantics.

Limitations & Future Work

  • LLM dependence – The quality of generated hypotheses hinges on the underlying model; edge‑case bugs may still require manual guidance.
  • Static analysis precision – The lightweight analyses may miss deep data‑flow relationships, leading to incomplete constraint sets for highly obfuscated code.
  • Scalability to large codebases – While the current prototype handles isolated binaries well, scaling the semantic extraction to massive applications could increase overhead.
  • Future directions suggested by the authors include: tighter integration with symbolic execution for harder constraints, adaptive budgeting of LLM calls based on confidence scores, and extending the framework to multi‑process or networked targets.

Authors

  • Haochen Zeng
  • Andrew Bao
  • Jiajun Cheng
  • Chengyu Song

Paper Information

  • arXiv ID: 2512.04611v1
  • Categories: cs.CR, cs.SE
  • Published: December 4, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »