[Paper] Coverage-Guided Multi-Agent Harness Generation for Java Library Fuzzing

Published: (March 9, 2026 at 12:59 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.08616v1

Overview

The paper introduces an automated system that builds fuzz‑testing harnesses for Java libraries using a team of specialized LLM‑driven agents. By letting these agents collaborate to understand documentation, synthesize code, fix compilation issues, and focus coverage on the exact method under test, the approach removes the tedious manual work that usually blocks library‑level fuzzing.

Key Contributions

  • Multi‑agent architecture: Five ReAct agents (research, synthesis, compilation repair, coverage analysis, refinement) work together to generate, compile, and polish harnesses.
  • Model Context Protocol: Agents query documentation, source files, and call‑graph data on demand, keeping the LLM’s context small and relevant.
  • Method‑targeted coverage metric: Coverage is measured only while the target API method runs, isolating its behavior from surrounding setup code.
  • Agent‑guided termination: The system stops refinement when uncovered code no longer yields useful fuzzing opportunities, avoiding wasted compute.
  • Empirical validation: Tested on 7 methods from 6 popular Java libraries (≈115 k Maven dependents), achieving a median 26 % coverage boost over OSS‑Fuzz baselines and a 5 % edge over Jazzer AutoFuzz.
  • Cost‑effective: Average generation cost of $3.20 and ~10 minutes per harness, making it viable for CI pipelines.
  • Real‑world bug discovery: In a 12‑hour fuzzing run, the generated harnesses uncovered 3 previously unknown bugs in projects already part of OSS‑Fuzz.

Methodology

  1. Agent Roles

    • Research Agent: Pulls API docs, Javadoc, and call‑graph snippets to understand required object construction and preconditions.
    • Synthesis Agent: Writes the initial harness code (object creation, method call, input handling) using the gathered context.
    • Compilation‑Repair Agent: Attempts to compile the harness, then iteratively patches syntax or type errors reported by the Java compiler.
    • Coverage‑Analysis Agent: Executes the harness under a coverage‑instrumented fuzzer, records method‑targeted coverage, and flags uncovered lines.
    • Refinement Agent: Uses the uncovered‑code report to request additional context from the Research Agent and iterates synthesis/repair until marginal gains disappear.
  2. Model Context Protocol

    • Instead of feeding the entire library to the LLM, agents request focused snippets (e.g., a constructor signature or a specific exception contract).
    • This keeps the prompt size low, reduces hallucinations, and speeds up inference.
  3. Fuzzing Loop

    • The generated harness is handed to Jazzer (the Java fuzzing engine).
    • Coverage is filtered to count only instructions executed inside the target method, providing a clean signal for the agents to improve the harness.
  4. Termination Logic

    • When the Coverage‑Analysis Agent sees that additional uncovered lines are outside the target method or belong to boilerplate code, the Refinement Agent stops further iterations.

Results & Findings

MetricBaseline (OSS‑Fuzz)Jazzer AutoFuzzProposed System
Median coverage improvement (target method)0 %+5 %+26 %
Generation time per harness~10 min
Cost per harness (LLM inference)$3.20
Bugs found in 12‑hour campaign3 new bugs
  • The method‑targeted coverage metric proved crucial: it prevented agents from over‑optimizing surrounding setup code that does not affect the fuzzed API.
  • Agent‑guided termination saved ~30 % of compilation‑repair cycles compared with a naïve “keep trying until timeout” strategy.
  • The generated harnesses were compatible with existing CI fuzzing pipelines, requiring only a single command to plug into OSS‑Fuzz.

Practical Implications

  • Continuous Integration: Teams can automatically generate harnesses for newly added or changed public APIs, keeping fuzzing coverage up‑to‑date without manual effort.
  • Library Maintainers: Open‑source projects can ship ready‑to‑use harnesses alongside their releases, lowering the barrier for downstream users to adopt fuzzing.
  • Security Audits: Security engineers can spin up targeted fuzzing campaigns on critical methods (e.g., deserialization, crypto primitives) within minutes, accelerating vulnerability discovery.
  • Cost Management: At a few dollars per harness, the approach scales to large monorepos or ecosystems (e.g., all Maven Central libraries) without blowing budgets.
  • Tooling Integration: The architecture can be wrapped as a CLI or GitHub Action that calls the LLM service, compiles the harness, and registers it with Jazzer or other Java fuzzers.

Limitations & Future Work

  • LLM Dependency: The quality of generated harnesses hinges on the underlying language model; cheaper models may produce more compilation errors, increasing repair cycles.
  • Documentation Gaps: When APIs lack comprehensive Javadoc or examples, the Research Agent may miss subtle preconditions, leading to incomplete harnesses.
  • Scope of Evaluation: Experiments covered only seven methods; broader studies across diverse library categories (e.g., UI, networking) are needed to confirm generality.
  • Cross‑Language Extensions: The current design is Java‑centric; adapting the protocol for other JVM languages or native libraries would require additional agents and tooling.
  • Dynamic Dependency Resolution: Some libraries load native binaries at runtime; handling such cases remains an open challenge.

Future work could explore hybrid agents that combine LLM reasoning with static analysis tools, expand the benchmark suite, and integrate cost‑aware scheduling to prioritize high‑impact APIs in large ecosystems.

Authors

  • Nils Loose
  • Nico Winkel
  • Kristoffer Hempel
  • Felix Mächtle
  • Julian Hans
  • Thomas Eisenbarth

Paper Information

  • arXiv ID: 2603.08616v1
  • Categories: cs.SE, cs.CR
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »