[Paper] PBFuzz: Agentic Directed Fuzzing for PoV Generation
Source: arXiv - 2512.04611v1
Overview
The paper introduces PBFuzz, an “agentic” directed fuzzing system that automatically generates Proof‑of‑Vulnerability (PoV) inputs. By mimicking the iterative reasoning of a human security analyst, PBFuzz dramatically speeds up the discovery of exploitable bugs, cutting the time‑to‑exposure from hours to minutes while keeping the cost of large‑language‑model (LLM) calls low.
Key Contributions
- Agentic fuzzing loop that combines code‑level reasoning, hypothesis generation, and feedback‑driven refinement, emulating how experts hunt for PoVs.
- Semantic constraint extraction via autonomous code analysis, separating reachability (getting to vulnerable code) from triggering (activating the bug).
- Custom program‑analysis plug‑ins that feed precise, target‑specific information to the LLM without exposing the whole codebase.
- Persistent memory store that retains hypotheses across iterations, preventing “drift” and enabling cumulative learning.
- Property‑based testing engine that solves extracted constraints while preserving the original input structure, improving efficiency over raw symbolic execution.
- Empirical validation on the Magma benchmark: 57 vulnerabilities triggered (17 uniquely), 25.6× faster than AFL++ + CmpLog, with an average API cost of only $1.83 per vulnerability.
Methodology
- Target selection – The system receives a binary and a list of suspected vulnerable locations (e.g., a
strcmpcheck). - Semantic analysis – Lightweight static analyses (control‑flow, data‑flow, taint) automatically infer two sets of constraints:
- Reachability: conditions needed to execute the vulnerable instruction.
- Triggering: conditions that cause the vulnerability to manifest (e.g., specific buffer sizes, magic values).
- Hypothesis generation – The extracted constraints are fed to an LLM that proposes concrete input “hypotheses” (partial byte sequences, format strings, etc.).
- Property‑based testing – Each hypothesis is turned into a test case that is executed against the program under a fuzzing harness. The harness checks whether the constraints are satisfied, using fast instrumentation (e.g., AFL‑style coverage).
- Feedback loop – Execution results (crash, coverage, constraint satisfaction) are stored in a persistent memory module. The LLM revisits the memory, refines or discards hypotheses, and generates new candidates.
- Termination – The loop stops when a PoV is found (crash with the targeted bug) or the time budget expires.
The whole pipeline runs autonomously, requiring only the initial target specification and a modest LLM API quota.
Results & Findings
| Metric | PBFuzz | Best Baseline (AFL++ + CmpLog) |
|---|---|---|
| Vulnerabilities triggered | 57 (17 exclusive) | 40 |
| Median time‑to‑exposure | 339 s | 8 680 s |
| Time budget per target | 30 min | 24 h |
| API cost per PoV | $1.83 | N/A (no LLM) |
| Success rate within budget | 84 % | 58 % |
Key takeaways
- Speed – By focusing on semantic constraints, PBFuzz avoids the blind exploration that dominates traditional greybox fuzzers.
- Coverage of hard bugs – The property‑based testing step preserves input structure, allowing the system to satisfy complex triggering conditions that fuzzers typically miss.
- Cost‑effectiveness – Even with LLM calls, the per‑vulnerability expense stays under $2, making the approach viable for continuous integration pipelines.
Practical Implications
- Security‑focused CI/CD – Teams can plug PBFuzz into nightly builds to automatically generate PoVs for newly introduced code paths, catching exploitable bugs before release.
- Bug‑bounty automation – Researchers can use the tool to quickly produce PoVs for disclosed CVEs, reducing manual reverse‑engineering effort.
- Fuzzing-as-a‑Service – Cloud providers could offer an “agentic fuzzing” tier that leverages LLM reasoning to deliver faster results for high‑value targets.
- Toolchain integration – Because PBFuzz works with standard instrumentation (e.g., AFL’s coverage map) and accepts custom analysis plug‑ins, it can be layered on top of existing fuzzers without a complete rewrite.
Overall, the approach bridges the gap between human expertise and automated testing, promising a new class of “smart” fuzzers that can reason about program semantics.
Limitations & Future Work
- LLM dependence – The quality of generated hypotheses hinges on the underlying model; edge‑case bugs may still require manual guidance.
- Static analysis precision – The lightweight analyses may miss deep data‑flow relationships, leading to incomplete constraint sets for highly obfuscated code.
- Scalability to large codebases – While the current prototype handles isolated binaries well, scaling the semantic extraction to massive applications could increase overhead.
- Future directions suggested by the authors include: tighter integration with symbolic execution for harder constraints, adaptive budgeting of LLM calls based on confidence scores, and extending the framework to multi‑process or networked targets.
Authors
- Haochen Zeng
- Andrew Bao
- Jiajun Cheng
- Chengyu Song
Paper Information
- arXiv ID: 2512.04611v1
- Categories: cs.CR, cs.SE
- Published: December 4, 2025
- PDF: Download PDF