[Paper] ChainFuzzer: Greybox Fuzzing for Workflow-Level Multi-Tool Vulnerabilities in LLM Agents
Source: arXiv
Source: arXiv:2603.12614v1
Overview
Large Language Model (LLM) agents are increasingly being equipped with a toolbox of external utilities—search, code execution, file handling, etc.—to tackle complex, real‑world tasks. While this multi‑tool orchestration boosts capability, it also creates hidden data‑flow pathways where the output of one tool becomes the input of another, opening the door to workflow‑level vulnerabilities that traditional single‑tool testing can’t catch.
The paper ChainFuzzer introduces a grey‑box fuzzing framework that automatically discovers, reproduces, and documents these multi‑tool attack chains in LLM agents.
Key Contributions
- Formalization of multi‑tool vulnerabilities in LLM agents, focusing on source‑to‑sink data flows that span several tool invocations.
- Chain extraction engine that builds high‑impact tool‑chain candidates from static analysis of tool dependencies, achieving >96 % edge precision.
- Trace‑guided Prompt Solving (TPS) – a prompt‑generation technique that reliably steers the LLM to follow a target chain (reachability ↑ from 27 % to 95 %).
- Guardrail‑aware fuzzing that mutates payloads while respecting LLM safety filters, boosting trigger rates from 18 % to 89 %.
- Large‑scale empirical evaluation on 20 open‑source LLM‑agent applications (≈ 1 000 tools), uncovering 365 reproducible bugs, 302 of which require multi‑tool execution.
Methodology
Static Dependency Mining
- The framework parses each tool’s API schema (input/output types) and builds a directed graph of possible data flows.
- Edges that connect a source (e.g., user‑provided string) to a sink (e.g., command execution) are flagged as high‑impact.
Candidate Chain Generation
- Using the graph, ChainFuzzer extracts plausible tool chains that satisfy a strict source‑to‑sink ordering.
- Chains are pruned to keep only those with high precision (≥ 91 % strict‑chain precision).
Trace‑Guided Prompt Solving (TPS)
- For each candidate chain, the system records a trace of the desired tool calls.
- TPS then asks the LLM to produce a prompt that, when fed to the agent, reproduces that trace.
- The process iterates until the generated prompt consistently triggers the full chain.
Guardrail‑Aware Fuzzing
- Once a stable prompt is obtained, ChainFuzzer mutates the payload (e.g., injecting shell‑escape characters) while monitoring LLM guardrails (content filters, refusal detectors).
- Specialized oracles check whether the final sink behaved maliciously (e.g., executed a command).
Evidence Collection
- Successful runs are logged with:
- the full prompt,
- the mutated payload, and
- a step‑by‑step execution trace.
- This provides auditable proof of the vulnerability.
- Successful runs are logged with:
Results & Findings
| Metric | Value |
|---|---|
| Candidate tool chains extracted | 2,388 |
| Stable prompts synthesized | 2,213 |
| Reproducible vulnerabilities found | 365 (19/20 apps) |
| Vulnerabilities requiring multi‑tool execution | 302 |
| Edge precision (tool‑graph) | 96.49 % |
| Strict chain precision | 91.50 % |
| Chain reachability after TPS | 95.45 % (↑ from 27.05 %) |
| Payload trigger rate after guardrail‑aware fuzzing | 88.60 % (↑ from 18.20 %) |
| Vulnerabilities per 1 M tokens processed | 3.02 |
Interpretation
- The majority of bugs are invisible to single‑hop testing; they become exploitable only when tool calls are stitched together into multi‑step chains.
- TPS (Tool‑Path Steering) dramatically improves the ability to drive the agent along the intended path, raising chain reachability from 27.05 % to 95.45 %.
- Guardrail‑aware fuzzing shows that even LLM safety filters can be bypassed when the attack is distributed across multiple tools, boosting the payload trigger rate from 18.20 % to 88.60 %.
Practical Implications
Security‑testing pipelines – Developers of LLM‑agent platforms can integrate ChainFuzzer (or its concepts) into CI/CD workflows to automatically hunt for hidden workflow bugs before release.
Tool‑design guidelines – The high‑impact source‑to‑sink patterns identified (e.g., unvalidated file content → shell execution) provide concrete “hardening” rules: sanitize outputs before they become inputs to downstream tools.
Guardrail evaluation – The study shows that existing LLM content filters are insufficient against multi‑step attacks. Vendors should augment guardrails with cross‑tool context awareness.
Incident response – The auditable evidence (prompt + trace) generated by ChainFuzzer can be used to quickly reproduce and patch a vulnerability, reducing mean‑time‑to‑remediation.
Compliance & auditing – For regulated industries (finance, healthcare), the framework offers a systematic way to demonstrate that an LLM‑agent’s toolchain has been security‑tested, supporting audit requirements.
Limitations & Future Work
Limitations
- Tool‑coverage bias – The evaluation focuses on open‑source agents with relatively well‑documented APIs; proprietary or dynamically generated tools may evade static dependency extraction.
- Prompt stability – TPS works well for the tested models, but prompt brittleness can increase with larger, more stochastic LLMs, requiring additional stabilization heuristics.
- Guardrail modeling – The current fuzzing assumes static guardrails; adaptive or learning‑based filters could change behavior mid‑execution, demanding more sophisticated oracle designs.
- Scalability to massive agent ecosystems – While 1 000 tools were handled, scaling to tens of thousands may need graph‑partitioning or sampling strategies.
Future Research Directions
- Extend the framework to runtime monitoring (detecting malicious chains in production).
- Integrate formal verification of tool contracts.
- Explore defense mechanisms such as automated input‑sanitization pipelines that are aware of cross‑tool data flows.
Authors
- Jiangrong Wu
- Zitong Yao
- Yuhong Nan
- Zibin Zheng
Paper Information
| Field | Details |
|---|---|
| arXiv ID | 2603.12614v1 |
| Categories | cs.SE, cs.CR |
| Published | March 13, 2026 |
| Download PDF |