[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

Published: 3 days ago (February 26, 2026 at 09:53 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23065v1

Overview

Deep learning frameworks such as PyTorch, TensorFlow, and MindSpore power everything from research prototypes to production‑grade AI services. While existing fuzzers can easily surface crashes, they miss the far more insidious silent bugs—errors that don’t crash the program but silently corrupt model results. The paper introduces TransFuzz, a novel approach that harnesses large language models (LLMs) to “transfer” bug‑finding knowledge from historic issue reports into new, targeted test cases, enabling systematic discovery of silent bugs across DL libraries.

Key Contributions

LLM‑driven bug pattern extraction – parses historical issue tickets to learn context‑aware descriptions of silent bugs.
Functionality‑based API matching – uses semantic embeddings to locate APIs in other libraries that are behaviorally similar to the buggy ones.
Automated test‑case synthesis with custom oracles – generates inputs and checks (oracles) that can detect subtle misbehaviors rather than just crashes.
Self‑validation module – an LLM‑powered step that automatically verifies whether a transferred bug instance is plausible before fuzzing.
TransFuzz prototype – evaluated on PyTorch, TensorFlow, and MindSpore, uncovering 79 previously unknown bugs, including 12 CVEs spanning 10 bug categories.

Methodology

Mining historic bug reports – The system crawls issue trackers (GitHub, mailing lists, etc.) and feeds each report to an LLM (e.g., GPT‑4). The model extracts a bug pattern: the API involved, the erroneous condition, and the oracle that would expose the bug.
Embedding APIs – Every public API in the target DL library is represented by a vector derived from its documentation, type signatures, and example code. These vectors capture functional similarity.
Controlled bug transfer – For a given historic bug pattern, the system finds the nearest APIs in the embedding space. It then asks the LLM to rewrite the pattern for the new API, preserving the logical flaw while adapting argument names, data shapes, etc.
Test case generation – The LLM produces concrete Python snippets that invoke the target API with realistic tensors and embed the custom oracle (e.g., “output should be numerically identical to a reference implementation”).
Self‑validation – Before launching a fuzzing campaign, the LLM checks the generated test for syntactic correctness and logical consistency, discarding dubious transfers.
Fuzzing loop – The validated tests are fed to a coverage‑guided fuzzer that mutates input tensors. Whenever the oracle signals a deviation, the bug is recorded for manual triage.

The whole pipeline is fully automated, requiring only the initial set of historic bug reports as seed data.

Results & Findings

Library	Bugs discovered	Confirmed CVEs	Bug types covered
PyTorch	31	5	shape‑mismatch, precision loss, gradient‑incorrectness, etc.
TensorFlow	28	4	dtype conversion errors, silent overflow, memory‑leak‑free bugs
MindSpore	20	3	broadcasting mistakes, optimizer state drift

79 new bugs were reported; 12 were accepted as CVEs by the respective vendors.
The silent‑bug detection rate was ~3× higher than a state‑of‑the‑art DL fuzzer that only looks for crashes.
The self‑validation step filtered out ~22 % of generated transfers, reducing wasted fuzzing time without sacrificing recall.

Practical Implications

Proactive security testing – Developers can integrate TransFuzz into CI pipelines to catch silent correctness regressions before release, complementing traditional crash‑only fuzzers.
Cross‑library robustness – Because bug patterns are transferred across frameworks, a fix discovered in PyTorch can quickly surface analogous issues in TensorFlow or MindSpore, accelerating patch propagation.
Reduced manual effort – The LLM handles the heavy lifting of oracle design, which is usually the bottleneck for silent‑bug fuzzing. Teams can focus on triage and remediation rather than writing bespoke checks.
Better model reliability – Detecting silent numerical or gradient errors helps prevent downstream model drift, which is critical for regulated domains (healthcare, finance, autonomous systems).

Limitations & Future Work

LLM dependence – The quality of bug transfer hinges on the LLM’s ability to understand API semantics; inaccurate rewrites can generate false positives or miss subtle bugs.
Scalability of embeddings – Embedding every public API works for the three evaluated libraries, but scaling to massive, rapidly evolving codebases may require incremental or hierarchical embedding strategies.
Oracle expressiveness – Current oracles are primarily equality‑ or tolerance‑based checks; more complex properties (e.g., probabilistic guarantees) remain out of scope.
Future directions suggested by the authors include:
1. Fine‑tuning LLMs on domain‑specific code to improve pattern fidelity.
2. Extending the approach to other ML ecosystems (e.g., JAX, ONNX).
3. Exploring hybrid static‑dynamic analyses to prune the search space further.

Authors

Kunpeng Zhang
Dongwei Xiao
Daoyuan Wu
Jiali Zhao
Yuanyi Lin
Tongtong Xu
Shaohua Wang
Shuai Wang

Paper Information

arXiv ID: 2602.23065v1
Categories: cs.SE
Published: February 26, 2026
PDF: Download PDF

[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Array-Carrying Symbolic Execution for Function Contract Generation

[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

[Paper] Managing Uncertainty in LLM-based Multi-Agent System Operation

[Paper] Productivity and Collaboration in Hybrid Agile Teams: An Interview Study