[Paper] ChainFuzzer: Greybox Fuzzing for Workflow-Level Multi-Tool Vulnerabilities in LLM Agents

Published: 1 month ago (March 12, 2026 at 11:35 PM EDT)

5 min read

Source: arXiv

Source: arXiv

Overview

Large Language Model (LLM) agents are increasingly being equipped with a toolbox of external utilities—search, code execution, file handling, etc.—to tackle complex, real‑world tasks. While this multi‑tool orchestration boosts capability, it also creates hidden data‑flow pathways where the output of one tool becomes the input of another, opening the door to workflow‑level vulnerabilities that traditional single‑tool testing can’t catch.

The paper ChainFuzzer introduces a grey‑box fuzzing framework that automatically discovers, reproduces, and documents these multi‑tool attack chains in LLM agents.

Key Contributions

Formalization of multi‑tool vulnerabilities in LLM agents, focusing on source‑to‑sink data flows that span several tool invocations.
Chain extraction engine that builds high‑impact tool‑chain candidates from static analysis of tool dependencies, achieving >96 % edge precision.
Trace‑guided Prompt Solving (TPS) – a prompt‑generation technique that reliably steers the LLM to follow a target chain (reachability ↑ from 27 % to 95 %).
Guardrail‑aware fuzzing that mutates payloads while respecting LLM safety filters, boosting trigger rates from 18 % to 89 %.
Large‑scale empirical evaluation on 20 open‑source LLM‑agent applications (≈ 1 000 tools), uncovering 365 reproducible bugs, 302 of which require multi‑tool execution.

Methodology

Static Dependency Mining
- The framework parses each tool’s API schema (input/output types) and builds a directed graph of possible data flows.
- Edges that connect a source (e.g., user‑provided string) to a sink (e.g., command execution) are flagged as high‑impact.
Candidate Chain Generation
- Using the graph, ChainFuzzer extracts plausible tool chains that satisfy a strict source‑to‑sink ordering.
- Chains are pruned to keep only those with high precision (≥ 91 % strict‑chain precision).
Trace‑Guided Prompt Solving (TPS)
- For each candidate chain, the system records a trace of the desired tool calls.
- TPS then asks the LLM to produce a prompt that, when fed to the agent, reproduces that trace.
- The process iterates until the generated prompt consistently triggers the full chain.
Guardrail‑Aware Fuzzing
- Once a stable prompt is obtained, ChainFuzzer mutates the payload (e.g., injecting shell‑escape characters) while monitoring LLM guardrails (content filters, refusal detectors).
- Specialized oracles check whether the final sink behaved maliciously (e.g., executed a command).
Evidence Collection
- Successful runs are logged with:
  - the full prompt,
  - the mutated payload, and
  - a step‑by‑step execution trace.
- This provides auditable proof of the vulnerability.

Results & Findings

Metric	Value
Candidate tool chains extracted	2,388
Stable prompts synthesized	2,213
Reproducible vulnerabilities found	365 (19/20 apps)
Vulnerabilities requiring multi‑tool execution	302
Edge precision (tool‑graph)	96.49 %
Strict chain precision	91.50 %
Chain reachability after TPS	95.45 % (↑ from 27.05 %)
Payload trigger rate after guardrail‑aware fuzzing	88.60 % (↑ from 18.20 %)
Vulnerabilities per 1 M tokens processed	3.02

Interpretation

The majority of bugs are invisible to single‑hop testing; they become exploitable only when tool calls are stitched together into multi‑step chains.
TPS (Tool‑Path Steering) dramatically improves the ability to drive the agent along the intended path, raising chain reachability from 27.05 % to 95.45 %.
Guardrail‑aware fuzzing shows that even LLM safety filters can be bypassed when the attack is distributed across multiple tools, boosting the payload trigger rate from 18.20 % to 88.60 %.

Practical Implications

Security‑testing pipelines – Developers of LLM‑agent platforms can integrate ChainFuzzer (or its concepts) into CI/CD workflows to automatically hunt for hidden workflow bugs before release.
Tool‑design guidelines – The high‑impact source‑to‑sink patterns identified (e.g., unvalidated file content → shell execution) provide concrete “hardening” rules: sanitize outputs before they become inputs to downstream tools.
Guardrail evaluation – The study shows that existing LLM content filters are insufficient against multi‑step attacks. Vendors should augment guardrails with cross‑tool context awareness.
Incident response – The auditable evidence (prompt + trace) generated by ChainFuzzer can be used to quickly reproduce and patch a vulnerability, reducing mean‑time‑to‑remediation.
Compliance & auditing – For regulated industries (finance, healthcare), the framework offers a systematic way to demonstrate that an LLM‑agent’s toolchain has been security‑tested, supporting audit requirements.

Limitations & Future Work

Limitations

Tool‑coverage bias – The evaluation focuses on open‑source agents with relatively well‑documented APIs; proprietary or dynamically generated tools may evade static dependency extraction.
Prompt stability – TPS works well for the tested models, but prompt brittleness can increase with larger, more stochastic LLMs, requiring additional stabilization heuristics.
Guardrail modeling – The current fuzzing assumes static guardrails; adaptive or learning‑based filters could change behavior mid‑execution, demanding more sophisticated oracle designs.
Scalability to massive agent ecosystems – While 1 000 tools were handled, scaling to tens of thousands may need graph‑partitioning or sampling strategies.

Future Research Directions

Extend the framework to runtime monitoring (detecting malicious chains in production).
Integrate formal verification of tool contracts.
Explore defense mechanisms such as automated input‑sanitization pipelines that are aware of cross‑tool data flows.

Authors

Jiangrong Wu
Zitong Yao
Yuhong Nan
Zibin Zheng

Paper Information

Field	Details
arXiv ID	`2603.12614v1`
Categories	`cs.SE`, `cs.CR`
Published	March 13, 2026
PDF	Download PDF

[Paper] ChainFuzzer: Greybox Fuzzing for Workflow-Level Multi-Tool Vulnerabilities in LLM Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

I Tested 50 AI App Prompts for Injection Attacks. 90% Scored CRITICAL.

Designing AI agents to resist prompt injection

완벽한 AI 가드레일을 향한 여정: NeurIPS 2025 최신 안전성 기술 분석

Building a Safer AI Co-Pilot: 3 Architecture Patterns from our ICU Hackathon Project