[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

Published: (February 26, 2026 at 09:53 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23065v1

Overview

Deep learning frameworks such as PyTorch, TensorFlow, and MindSpore power everything from research prototypes to production‑grade AI services. While existing fuzzers can easily surface crashes, they miss the far more insidious silent bugs—errors that don’t crash the program but silently corrupt model results. The paper introduces TransFuzz, a novel approach that harnesses large language models (LLMs) to “transfer” bug‑finding knowledge from historic issue reports into new, targeted test cases, enabling systematic discovery of silent bugs across DL libraries.

Key Contributions

  • LLM‑driven bug pattern extraction – parses historical issue tickets to learn context‑aware descriptions of silent bugs.
  • Functionality‑based API matching – uses semantic embeddings to locate APIs in other libraries that are behaviorally similar to the buggy ones.
  • Automated test‑case synthesis with custom oracles – generates inputs and checks (oracles) that can detect subtle misbehaviors rather than just crashes.
  • Self‑validation module – an LLM‑powered step that automatically verifies whether a transferred bug instance is plausible before fuzzing.
  • TransFuzz prototype – evaluated on PyTorch, TensorFlow, and MindSpore, uncovering 79 previously unknown bugs, including 12 CVEs spanning 10 bug categories.

Methodology

  1. Mining historic bug reports – The system crawls issue trackers (GitHub, mailing lists, etc.) and feeds each report to an LLM (e.g., GPT‑4). The model extracts a bug pattern: the API involved, the erroneous condition, and the oracle that would expose the bug.
  2. Embedding APIs – Every public API in the target DL library is represented by a vector derived from its documentation, type signatures, and example code. These vectors capture functional similarity.
  3. Controlled bug transfer – For a given historic bug pattern, the system finds the nearest APIs in the embedding space. It then asks the LLM to rewrite the pattern for the new API, preserving the logical flaw while adapting argument names, data shapes, etc.
  4. Test case generation – The LLM produces concrete Python snippets that invoke the target API with realistic tensors and embed the custom oracle (e.g., “output should be numerically identical to a reference implementation”).
  5. Self‑validation – Before launching a fuzzing campaign, the LLM checks the generated test for syntactic correctness and logical consistency, discarding dubious transfers.
  6. Fuzzing loop – The validated tests are fed to a coverage‑guided fuzzer that mutates input tensors. Whenever the oracle signals a deviation, the bug is recorded for manual triage.

The whole pipeline is fully automated, requiring only the initial set of historic bug reports as seed data.

Results & Findings

LibraryBugs discoveredConfirmed CVEsBug types covered
PyTorch315shape‑mismatch, precision loss, gradient‑incorrectness, etc.
TensorFlow284dtype conversion errors, silent overflow, memory‑leak‑free bugs
MindSpore203broadcasting mistakes, optimizer state drift
  • 79 new bugs were reported; 12 were accepted as CVEs by the respective vendors.
  • The silent‑bug detection rate was ~3× higher than a state‑of‑the‑art DL fuzzer that only looks for crashes.
  • The self‑validation step filtered out ~22 % of generated transfers, reducing wasted fuzzing time without sacrificing recall.

Practical Implications

  • Proactive security testing – Developers can integrate TransFuzz into CI pipelines to catch silent correctness regressions before release, complementing traditional crash‑only fuzzers.
  • Cross‑library robustness – Because bug patterns are transferred across frameworks, a fix discovered in PyTorch can quickly surface analogous issues in TensorFlow or MindSpore, accelerating patch propagation.
  • Reduced manual effort – The LLM handles the heavy lifting of oracle design, which is usually the bottleneck for silent‑bug fuzzing. Teams can focus on triage and remediation rather than writing bespoke checks.
  • Better model reliability – Detecting silent numerical or gradient errors helps prevent downstream model drift, which is critical for regulated domains (healthcare, finance, autonomous systems).

Limitations & Future Work

  • LLM dependence – The quality of bug transfer hinges on the LLM’s ability to understand API semantics; inaccurate rewrites can generate false positives or miss subtle bugs.
  • Scalability of embeddings – Embedding every public API works for the three evaluated libraries, but scaling to massive, rapidly evolving codebases may require incremental or hierarchical embedding strategies.
  • Oracle expressiveness – Current oracles are primarily equality‑ or tolerance‑based checks; more complex properties (e.g., probabilistic guarantees) remain out of scope.
  • Future directions suggested by the authors include:
    1. Fine‑tuning LLMs on domain‑specific code to improve pattern fidelity.
    2. Extending the approach to other ML ecosystems (e.g., JAX, ONNX).
    3. Exploring hybrid static‑dynamic analyses to prune the search space further.

Authors

  • Kunpeng Zhang
  • Dongwei Xiao
  • Daoyuan Wu
  • Jiali Zhao
  • Yuanyi Lin
  • Tongtong Xu
  • Shaohua Wang
  • Shuai Wang

Paper Information

  • arXiv ID: 2602.23065v1
  • Categories: cs.SE
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »