[Paper] Bootstrapping Fuzzers for Compilers of Low-Resource Language Dialects Using Language Models
Source: arXiv - 2512.05887v1
Overview
The paper introduces Germinator, a tool that automatically creates high‑quality test seeds for compiler dialects built on the MLIR framework. By marrying automatically extracted dialect grammars with large language models (LLMs), the authors achieve fuzzing that works across any dialect while still being effective at exposing dialect‑specific bugs—a long‑standing pain point for developers of low‑resource language extensions.
Key Contributions
- Dialect‑agnostic seed generation: Extracts formal grammars directly from MLIR dialect specifications, eliminating the need for hand‑crafted seed corpora.
- LLM‑driven seed bootstrapping: Uses pre‑trained large language models to sample diverse, type‑correct programs from the extracted grammars, requiring no additional training data.
- Coverage‑guided fuzzing integration: Feeds the LLM‑generated seeds into a standard coverage‑guided fuzzer, dramatically improving code‑coverage metrics.
- Empirical validation on a large scale: Tested on six MLIR projects covering 91 dialects, showing 10‑120 % line‑coverage gains over grammar‑only baselines.
- Bug discovery: Uncovered 88 new bugs (40 confirmed) across 23 dialects that previously had no automated testing support.
Methodology
- Grammar Extraction – Each MLIR dialect ships with a declarative specification (e.g., TableGen). The authors parse these specs to automatically build a context‑free grammar that captures the dialect’s syntactic forms and type constraints.
- LLM Prompting – The extracted grammar is translated into a prompt for a large language model (e.g., GPT‑3.5). The model is asked to generate random programs that respect the grammar rules, effectively turning the LLM into a “smart seed generator.”
- Seed Filtering – Generated programs are type‑checked against the dialect’s verifier; only those that pass are kept as valid seeds.
- Bootstrapping Fuzzers – The validated seeds seed a coverage‑guided fuzzer (e.g., AFL++). The fuzzer then mutates the seeds, guided by runtime coverage feedback, to explore deeper compiler paths.
- Evaluation – The authors compare Germinator’s seed set against a pure‑grammar baseline on line‑coverage and bug‑finding metrics across multiple real‑world dialects.
Results & Findings
- Coverage Boost: Across the 91 dialects, Germinator’s seeds increased line coverage by 10 % to 120 % compared with a grammar‑only seed generator.
- Bug Yield: The fuzzing campaign discovered 88 previously unknown bugs, of which 40 were confirmed by the dialect maintainers.
- Low‑Resource Success: For 23 dialects that lacked any prior automated test generators, Germinator still produced effective seeds, proving the approach works even when developer resources are scarce.
- Scalability: The entire pipeline—from grammar extraction to bug discovery—ran automatically on all dialects without manual tuning, demonstrating practical scalability.
Practical Implications
- Faster Compiler Development: Teams building new MLIR dialects can plug Germinator into their CI pipelines and obtain immediate, high‑coverage test suites without writing seed corpora.
- Reduced Maintenance Overhead: Because seeds are regenerated from the dialect spec, they stay in sync with evolving language features, cutting down on brittle test‑suite maintenance.
- Improved Reliability of Domain‑Specific Languages: Industries that rely on custom DSLs (e.g., graphics, ML, hardware design) can catch subtle compiler crashes early, leading to more stable toolchains for end‑users.
- Leverage Existing LLM Investments: Organizations already using LLM APIs can reuse those services for seed generation, turning a “nice‑to‑have” AI capability into a concrete quality‑assurance asset.
Limitations & Future Work
- Dependence on Grammar Quality: If a dialect’s specification is incomplete or contains errors, the extracted grammar (and thus the seeds) may miss critical constructs.
- LLM Prompt Sensitivity: The diversity of generated programs can vary with prompt design and model version; tuning may be required for edge‑case dialects.
- Runtime Overhead: The initial LLM generation step adds latency compared to pure grammar sampling, though this cost is amortized over the fuzzing campaign.
- Future Directions: The authors suggest tighter integration with type‑inference engines to prune invalid seeds earlier, exploring smaller, fine‑tuned LLMs for faster generation, and extending the approach to other extensible compiler infrastructures beyond MLIR.
Authors
- Sairam Vaidya
- Marcel Böhme
- Loris D’Antoni
Paper Information
- arXiv ID: 2512.05887v1
- Categories: cs.SE, cs.LG, cs.PL
- Published: December 5, 2025
- PDF: Download PDF