[Paper] HarnessAgent: Scaling Automatic Fuzzing Harness Construction with Tool-Augmented LLM Pipelines

Published: (December 2, 2025 at 10:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.03420v1

Overview

The paper introduces HarnessAgent, an automated pipeline that combines large language models (LLMs) with traditional software‑engineering tools to generate fuzzing harnesses for thousands of functions in open‑source projects. By tackling the missing‑context problem and filtering out “syntactically‑correct but semantically‑useless” code, HarnessAgent pushes LLM‑driven fuzzing from proof‑of‑concept to a scalable, production‑ready workflow.

Key Contributions

  • Rule‑based compilation error mitigation – a lightweight static‑analysis layer that detects and rewrites common build‑time failures before they stop the pipeline.
  • Hybrid tool‑pool for symbol retrieval – integrates clang‑based indexers, language‑server protocols, and repository‑wide grep utilities to locate the exact source of a target function with >90 % success.
  • Enhanced harness validation – a multi‑stage checker that spots “fake” definitions (e.g., placeholder stubs that only satisfy the LLM’s validation metric) and rejects them early.
  • Empirical evaluation on OSS‑Fuzz – tested on 243 target functions across 65 C and 178 C++ projects, achieving an 87 % success rate for C and 81 % for C++ (≈20 % improvement over prior art).
  • Real‑world fuzzing impact – >75 % of generated harnesses increased coverage of the target function in a one‑hour fuzzing run, outperforming baselines by >10 %.

Methodology

  1. Target Selection – Functions are sampled from OSS‑Fuzz projects, focusing on internal (non‑public) APIs that lack ready‑made harnesses.
  2. Context Enrichment
    • A rule engine parses compilation errors (missing headers, undefined symbols, type mismatches) and automatically injects fixes (e.g., adding #includes, typedef stubs).
    • A hybrid retrieval stack queries:
      • Clangd/LSP for precise AST‑level symbol locations,
      • csearch/ctags for fast coarse‑grained matches, and
      • git‑grep as a fallback.
  3. LLM Prompting – The enriched context (function signature, surrounding code snippets, usage examples) is fed to a state‑of‑the‑art LLM (e.g., GPT‑4) that emits a candidate harness in C/C++.
  4. Validation Pipeline
    • Static checks (compiles with clang, runs clang‑tidy).
    • Dynamic sanity – executes the harness with a minimal input to ensure the target function is actually invoked.
    • Fake‑definition detector – compares the generated stub against the retrieved symbol to catch placeholder code that only satisfies the LLM’s surface‑level metrics.
  5. Iterative Refinement – If any check fails, the pipeline loops back, providing the LLM with error feedback until the harness passes all stages or a timeout is reached.

Results & Findings

MetricC projectsC++ projects
Three‑shot success rate (harness compiles & validates)87 %81 %
Coverage gain (≥1 % increase in target function coverage after 1 h fuzzing)>75 % of generated harnesses>75 % of generated harnesses
Source‑code retrieval hit‑rate>90 % (vs. 58 % for Fuzz Introspector)
Overall improvement vs. prior art+20 % success, +10 % coverage boost+20 % success, +10 % coverage boost

Key takeaways

  • The rule‑based error fixer eliminates the majority of build‑breakers that previously forced manual intervention.
  • Combining multiple retrieval tools dramatically reduces “missing symbol” failures, a common bottleneck for internal functions.
  • The fake‑definition filter prevents the LLM from gaming the validation metric, leading to genuinely executable harnesses.

Practical Implications

  • Faster security testing pipelines – Teams can auto‑generate harnesses for legacy codebases without hand‑crafting them, cutting weeks of manual effort down to hours.
  • Broader OSS‑Fuzz participation – Projects that previously lacked harnesses can now be onboarded automatically, expanding the surface area of continuous fuzzing.
  • Developer tooling – The rule‑engine and hybrid retrieval stack can be repurposed as IDE plugins that suggest missing includes or stub implementations on the fly.
  • Cost reduction – Higher‑quality harnesses mean fuzzers spend more time exercising real logic rather than hitting dead ends, improving bug‑finding ROI.
  • LLM safety – The validation pipeline showcases a practical pattern for guarding LLM outputs against superficial correctness tricks, a lesson applicable to code‑generation assistants beyond fuzzing.

Limitations & Future Work

  • Language scope – The current implementation targets C and C++; extending to Rust, Go, or Java would require language‑specific retrieval and compilation fixes.
  • LLM dependence – While the pipeline mitigates many LLM quirks, it still relies on a powerful model; performance may degrade with smaller, open‑source LLMs.
  • Dynamic analysis depth – Validation only checks that the target function is called, not that the generated inputs meaningfully explore edge cases; integrating coverage‑guided input synthesis is a next step.
  • Scalability to millions of functions – The study evaluated 243 functions; future work will stress‑test the system on entire codebases to assess runtime and resource footprints.

Overall, HarnessAgent demonstrates that a carefully orchestrated blend of traditional tooling and LLMs can make automatic fuzzing harness generation a practical, scalable reality for modern software development.

Authors

  • Kang Yang
  • Yunhang Zhang
  • Zichuan Li
  • GuanHong Tao
  • Jun Xu
  • XiaoJing Liao

Paper Information

  • arXiv ID: 2512.03420v1
  • Categories: cs.CR, cs.SE
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »