[Paper] PBFuzz: Agentic Directed Fuzzing for PoV Generation

Published: 1 month ago (December 4, 2025 at 04:34 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04611v1

Overview

The paper introduces PBFuzz, an “agentic” directed fuzzing system that automatically generates Proof‑of‑Vulnerability (PoV) inputs. By mimicking the iterative reasoning of a human security analyst, PBFuzz dramatically speeds up the discovery of exploitable bugs, cutting the time‑to‑exposure from hours to minutes while keeping the cost of large‑language‑model (LLM) calls low.

Key Contributions

Agentic fuzzing loop that combines code‑level reasoning, hypothesis generation, and feedback‑driven refinement, emulating how experts hunt for PoVs.
Semantic constraint extraction via autonomous code analysis, separating reachability (getting to vulnerable code) from triggering (activating the bug).
Custom program‑analysis plug‑ins that feed precise, target‑specific information to the LLM without exposing the whole codebase.
Persistent memory store that retains hypotheses across iterations, preventing “drift” and enabling cumulative learning.
Property‑based testing engine that solves extracted constraints while preserving the original input structure, improving efficiency over raw symbolic execution.
Empirical validation on the Magma benchmark: 57 vulnerabilities triggered (17 uniquely), 25.6× faster than AFL++ + CmpLog, with an average API cost of only $1.83 per vulnerability.

Methodology

Target selection – The system receives a binary and a list of suspected vulnerable locations (e.g., a strcmp check).
Semantic analysis – Lightweight static analyses (control‑flow, data‑flow, taint) automatically infer two sets of constraints:
- Reachability: conditions needed to execute the vulnerable instruction.
- Triggering: conditions that cause the vulnerability to manifest (e.g., specific buffer sizes, magic values).
Hypothesis generation – The extracted constraints are fed to an LLM that proposes concrete input “hypotheses” (partial byte sequences, format strings, etc.).
Property‑based testing – Each hypothesis is turned into a test case that is executed against the program under a fuzzing harness. The harness checks whether the constraints are satisfied, using fast instrumentation (e.g., AFL‑style coverage).
Feedback loop – Execution results (crash, coverage, constraint satisfaction) are stored in a persistent memory module. The LLM revisits the memory, refines or discards hypotheses, and generates new candidates.
Termination – The loop stops when a PoV is found (crash with the targeted bug) or the time budget expires.

The whole pipeline runs autonomously, requiring only the initial target specification and a modest LLM API quota.

Results & Findings

Metric	PBFuzz	Best Baseline (AFL++ + CmpLog)
Vulnerabilities triggered	57 (17 exclusive)	40
Median time‑to‑exposure	339 s	8 680 s
Time budget per target	30 min	24 h
API cost per PoV	$1.83	N/A (no LLM)
Success rate within budget	84 %	58 %

Key takeaways

Speed – By focusing on semantic constraints, PBFuzz avoids the blind exploration that dominates traditional greybox fuzzers.
Coverage of hard bugs – The property‑based testing step preserves input structure, allowing the system to satisfy complex triggering conditions that fuzzers typically miss.
Cost‑effectiveness – Even with LLM calls, the per‑vulnerability expense stays under $2, making the approach viable for continuous integration pipelines.

Practical Implications

Security‑focused CI/CD – Teams can plug PBFuzz into nightly builds to automatically generate PoVs for newly introduced code paths, catching exploitable bugs before release.
Bug‑bounty automation – Researchers can use the tool to quickly produce PoVs for disclosed CVEs, reducing manual reverse‑engineering effort.
Fuzzing-as-a‑Service – Cloud providers could offer an “agentic fuzzing” tier that leverages LLM reasoning to deliver faster results for high‑value targets.
Toolchain integration – Because PBFuzz works with standard instrumentation (e.g., AFL’s coverage map) and accepts custom analysis plug‑ins, it can be layered on top of existing fuzzers without a complete rewrite.

Overall, the approach bridges the gap between human expertise and automated testing, promising a new class of “smart” fuzzers that can reason about program semantics.

Limitations & Future Work

LLM dependence – The quality of generated hypotheses hinges on the underlying model; edge‑case bugs may still require manual guidance.
Static analysis precision – The lightweight analyses may miss deep data‑flow relationships, leading to incomplete constraint sets for highly obfuscated code.
Scalability to large codebases – While the current prototype handles isolated binaries well, scaling the semantic extraction to massive applications could increase overhead.
Future directions suggested by the authors include: tighter integration with symbolic execution for harder constraints, adaptive budgeting of LLM calls based on confidence scores, and extending the framework to multi‑process or networked targets.

Authors

Haochen Zeng
Andrew Bao
Jiajun Cheng
Chengyu Song

Paper Information

arXiv ID: 2512.04611v1
Categories: cs.CR, cs.SE
Published: December 4, 2025
PDF: Download PDF

[Paper] PBFuzz: Agentic Directed Fuzzing for PoV Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MicroRacer: Detecting Concurrency Bugs for Cloud Service Systems

[Paper] Executing Discrete/Continuous Declarative Process Specifications via Complex Event Processing

[Paper] Compiling Away the Overhead of Race Detection

[Paper] Automated Code Review Assignments: An Alternative Perspective of Code Ownership on GitHub