[Paper] ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs

Published: (May 1, 2026 at 01:19 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.00413v1

Overview

The paper presents ClozeMaster, a novel technique that leverages large language models (LLMs) to generate realistic Rust programs for fuzz‑testing the Rust compiler. By masking and “infilling” snippets taken from real bug reports, the authors achieve a high yield of valid, bug‑triggering test cases—something that pure LLM generation has struggled with.

Key Contributions

  • ClozeMask strategy: a bracket‑based masking scheme that extracts code fragments from historical Rust compiler issues, masks them, and asks an LLM to fill the gaps, preserving syntactic and semantic plausibility.
  • CLOZEMASTER prototype: an end‑to‑end pipeline that automates issue mining, mask creation, LLM‑driven infilling, compilation, and bug detection for both rustc and the alternative mrustc.
  • Empirical validation: discovery of 27 previously unknown compiler bugs (10 already fixed) and a measurable boost in code‑coverage compared with state‑of‑the‑art Rust fuzzers (e.g., cargo‑fuzz, AFL‑rs).
  • Open‑source artifacts: the authors release the masking scripts, prompts, and a dataset of masked snippets, enabling reproducibility and community extension.

Methodology

  1. Issue Mining – Crawl the Rust compiler’s GitHub issue tracker and collect patches that fixed bugs.
  2. Snippet Extraction – Identify “interesting” code regions (e.g., macro invocations, unsafe blocks) and surround them with a special bracket token [[MASK]].
  3. LLM Infilling – Feed the masked program plus a concise prompt to a powerful LLM (e.g., GPT‑4). The model generates plausible replacements for each [[MASK]].
  4. Validation & Filtering – Run the generated program through the compiler; discard programs that fail to parse or compile without exercising the targeted language features.
  5. Bug Detection – Execute the compiled program under various compiler flags and runtime sanitizers; any crash, assertion failure, or mis‑compilation is reported as a candidate bug.

The pipeline is fully automated, allowing thousands of masked programs to be turned into test cases with minimal human oversight.

Results & Findings

MetricCLOZEMASTERBaseline Fuzzers
Unique bugs found (rustc)27 (10 fixed)12
Code‑coverage (lines)↑ 23 % vs. cargo‑fuzz
Valid test‑case ratio (compiles)78 %41 %
Time to first bug (hours)4.29.7
  • Bug diversity: The discovered bugs span parsing edge‑cases, macro expansion, unsafe‑code handling, and even optimizer mis‑behaviors.
  • Higher coverage: Because the masked snippets come from real‑world bug fixes, the generated programs naturally explore under‑tested language constructs.
  • Efficiency: The LLM‑infilling step adds only a few seconds per mask, making the overall pipeline comparable in speed to traditional fuzzers while delivering richer test inputs.

Practical Implications

  • For compiler developers – CLOZEMASTER can be integrated into nightly CI pipelines to continuously harvest new test cases from the ever‑growing issue backlog, catching regressions early.
  • For Rust library authors – The technique can be repurposed to generate edge‑case usage patterns of a library’s API, helping spot unsafe or undefined‑behavior scenarios before release.
  • For security teams – Because many bugs involve unsafe blocks or memory‑layout assumptions, the generated tests can serve as a source of potential security‑relevant vulnerabilities.
  • Tooling ecosystem – The masking‑infilling paradigm is language‑agnostic; other systems (e.g., Swift, Zig) could adopt a similar pipeline to boost their compiler fuzzing efforts without building massive grammar‑aware generators from scratch.

Limitations & Future Work

  • Dependence on existing bug reports – The approach works best when a rich history of issue patches is available; nascent languages may lack sufficient seed material.
  • LLM hallucination risk – Occasionally the model inserts code that compiles but does not meaningfully exercise the intended feature, leading to false positives that must be filtered.
  • Prompt engineering overhead – Tuning prompts for different compiler versions or LLM providers requires manual effort.
  • Future directions suggested by the authors include: (1) extending the mask granularity to whole functions or modules, (2) exploring smaller, open‑source LLMs for cost‑effective scaling, and (3) applying reinforcement learning to guide the LLM toward generating higher‑impact test cases.

Authors

  • Hongyan Gao
  • Yibiao Yang
  • Maolin Sun
  • Jiangchang Wu
  • Yuming Zhou
  • Baowen Xu

Paper Information

  • arXiv ID: 2605.00413v1
  • Categories: cs.SE
  • Published: May 1, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »