[Paper] ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs

Published: 4 days ago (May 1, 2026 at 01:19 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.00413v1

Overview

The paper presents ClozeMaster, a novel technique that leverages large language models (LLMs) to generate realistic Rust programs for fuzz‑testing the Rust compiler. By masking and “infilling” snippets taken from real bug reports, the authors achieve a high yield of valid, bug‑triggering test cases—something that pure LLM generation has struggled with.

Key Contributions

ClozeMask strategy: a bracket‑based masking scheme that extracts code fragments from historical Rust compiler issues, masks them, and asks an LLM to fill the gaps, preserving syntactic and semantic plausibility.
CLOZEMASTER prototype: an end‑to‑end pipeline that automates issue mining, mask creation, LLM‑driven infilling, compilation, and bug detection for both rustc and the alternative mrustc.
Empirical validation: discovery of 27 previously unknown compiler bugs (10 already fixed) and a measurable boost in code‑coverage compared with state‑of‑the‑art Rust fuzzers (e.g., cargo‑fuzz, AFL‑rs).
Open‑source artifacts: the authors release the masking scripts, prompts, and a dataset of masked snippets, enabling reproducibility and community extension.

Methodology

Issue Mining – Crawl the Rust compiler’s GitHub issue tracker and collect patches that fixed bugs.
Snippet Extraction – Identify “interesting” code regions (e.g., macro invocations, unsafe blocks) and surround them with a special bracket token [[MASK]].
LLM Infilling – Feed the masked program plus a concise prompt to a powerful LLM (e.g., GPT‑4). The model generates plausible replacements for each [[MASK]].
Validation & Filtering – Run the generated program through the compiler; discard programs that fail to parse or compile without exercising the targeted language features.
Bug Detection – Execute the compiled program under various compiler flags and runtime sanitizers; any crash, assertion failure, or mis‑compilation is reported as a candidate bug.

The pipeline is fully automated, allowing thousands of masked programs to be turned into test cases with minimal human oversight.

Results & Findings

Metric	CLOZEMASTER	Baseline Fuzzers
Unique bugs found (rustc)	27 (10 fixed)	12
Code‑coverage (lines)	↑ 23 % vs. cargo‑fuzz	—
Valid test‑case ratio (compiles)	78 %	41 %
Time to first bug (hours)	4.2	9.7

Bug diversity: The discovered bugs span parsing edge‑cases, macro expansion, unsafe‑code handling, and even optimizer mis‑behaviors.
Higher coverage: Because the masked snippets come from real‑world bug fixes, the generated programs naturally explore under‑tested language constructs.
Efficiency: The LLM‑infilling step adds only a few seconds per mask, making the overall pipeline comparable in speed to traditional fuzzers while delivering richer test inputs.

Practical Implications

For compiler developers – CLOZEMASTER can be integrated into nightly CI pipelines to continuously harvest new test cases from the ever‑growing issue backlog, catching regressions early.
For Rust library authors – The technique can be repurposed to generate edge‑case usage patterns of a library’s API, helping spot unsafe or undefined‑behavior scenarios before release.
For security teams – Because many bugs involve unsafe blocks or memory‑layout assumptions, the generated tests can serve as a source of potential security‑relevant vulnerabilities.
Tooling ecosystem – The masking‑infilling paradigm is language‑agnostic; other systems (e.g., Swift, Zig) could adopt a similar pipeline to boost their compiler fuzzing efforts without building massive grammar‑aware generators from scratch.

Limitations & Future Work

Dependence on existing bug reports – The approach works best when a rich history of issue patches is available; nascent languages may lack sufficient seed material.
LLM hallucination risk – Occasionally the model inserts code that compiles but does not meaningfully exercise the intended feature, leading to false positives that must be filtered.
Prompt engineering overhead – Tuning prompts for different compiler versions or LLM providers requires manual effort.
Future directions suggested by the authors include: (1) extending the mask granularity to whole functions or modules, (2) exploring smaller, open‑source LLMs for cost‑effective scaling, and (3) applying reinforcement learning to guide the LLM toward generating higher‑impact test cases.

Authors

Hongyan Gao
Yibiao Yang
Maolin Sun
Jiangchang Wu
Yuming Zhou
Baowen Xu

Paper Information

arXiv ID: 2605.00413v1
Categories: cs.SE
Published: May 1, 2026
PDF: Download PDF

[Paper] ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Research to Practice: An Interactive Rapid Review of Autonomous Driving System Testing in Industry

[Paper] EnCoDe: Energy Estimation of Source Code At Design-Time

[Paper] Q-ARE: An Evaluation Dataset for Query Based API Recommendation

[Paper] Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval