[Paper] Fine-Tuning GPT-5 for GPU Kernel Generation
Source: arXiv - 2602.11000v1
Overview
The paper presents Makora, a reinforcement‑learning (RL) framework that fine‑tunes GPT‑5 to generate high‑performance GPU kernels written in Triton. By moving beyond supervised fine‑tuning—hampered by scarce, high‑quality kernel data—the authors show that RL can dramatically improve both correctness and speed of generated kernels, closing the gap with hand‑crafted compilers like TorchInductor.
Key Contributions
- RL‑based fine‑tuning pipeline for a frontier LLM (GPT‑5) targeting GPU kernel generation.
- Makora environment that integrates a Triton compiler, runtime validator, and performance profiler for automated reward calculation.
- Significant accuracy boost: single‑attempt kernel correctness rises from 43.7 % to 77.0 % (‑+33.3 pp).
- Performance gains: problems beating TorchInductor increase from 14.8 % to 21.8 %; full‑agent solves 97.4 % of an expanded KernelBench suite, with a 2.12× geometric‑mean speedup on 72.9 % of cases.
- Demonstration of data‑efficient learning in a domain where labeled data is prohibitively expensive.
Methodology
- Problem Set & Benchmark – The authors curated a diverse set of kernel synthesis tasks from KernelBench, covering matrix ops, reductions, and convolutions across multiple GPU architectures.
- Makora RL Loop
- Policy Model: GPT‑5 (pre‑trained on general code).
- Action: Generate a Triton kernel given a high‑level description.
- Environment:
- Compilation with the Triton compiler.
- Execution on a target GPU to verify functional correctness.
- Profiling to measure runtime and memory usage.
- Reward Function: Composite score = correctness (binary) + scaled speed‑up over a baseline (TorchInductor) + penalties for compilation failures or excessive resource use.
- RL Algorithm – Proximal Policy Optimization (PPO) with a modest batch size (≈256 kernels per update) to keep training compute‑efficient.
- Curriculum & Sampling – Early epochs focus on simpler kernels (e.g., element‑wise ops) before moving to more complex patterns, ensuring stable learning despite sparse high‑reward events.
- Evaluation – After each training checkpoint, the model is tested on a held‑out KernelBench split, measuring both single‑attempt correctness and full‑agent success (allowing iterative refinement).
Results & Findings
| Metric | Baseline GPT‑5 | Fine‑tuned (RL) |
|---|---|---|
| Single‑attempt correctness | 43.7 % | 77.0 % (+33.3 pp) |
| Fraction beating TorchInductor | 14.8 % | 21.8 % (+7 pp) |
| Full‑agent solve rate (expanded suite) | — | 97.4 % |
| Problems where agent outperforms TorchInductor | — | 72.9 % |
| Geometric‑mean speedup (outperforming cases) | 1.0× | 2.12× |
Key takeaways:
- RL can compensate for the lack of labeled data by turning the compiler/runtime itself into a source of feedback.
- The fine‑tuned model learns hardware‑aware idioms (e.g., shared‑memory tiling) that were rarely present in the pre‑training corpus.
- Even a single‑attempt generation sees a dramatic jump in functional correctness, making the model viable for integration into developer tools.
Practical Implications
- Developer Assistants – IDE plugins could suggest ready‑to‑run Triton kernels, cutting down the trial‑and‑error cycle that currently dominates GPU performance engineering.
- Accelerator‑Specific Codegen – Companies building custom ASICs or newer GPU generations can plug their own compiler/runtime into Makora, quickly adapting a generic LLM to their hardware without needing massive kernel datasets.
- Automated Optimization Pipelines – CI/CD systems could automatically replace hand‑written kernels with RL‑enhanced LLM outputs, yielding immediate runtime savings (e.g., 2× speedup on common linear‑algebra kernels).
- Lower Barrier to Entry – Researchers and startups lacking deep GPU‑kernel expertise can leverage the model to prototype high‑performance kernels, accelerating AI model scaling on limited hardware budgets.
Limitations & Future Work
- Hardware Diversity – Experiments focus on a single GPU family; cross‑generation generalization remains an open question.
- Reward Sparsity – Extremely complex kernels still suffer from low reward signal, limiting scalability to very large workloads.
- Safety & Correctness Guarantees – The RL loop validates functional correctness but does not formally verify numerical stability or memory safety.
- Future Directions – Extending Makora to other accelerator languages (e.g., CUDA, SYCL), incorporating multi‑objective rewards (energy, memory bandwidth), and exploring hybrid supervised‑RL curricula to further reduce training time.
Authors
- Ali Tehrani
- Yahya Emara
- Essam Wissam
- Wojciech Paluch
- Waleed Atallah
- Łukasz Dudziak
- Mohamed S. Abdelfattah
Paper Information
- arXiv ID: 2602.11000v1
- Categories: cs.DC, cs.AI, cs.LG
- Published: February 11, 2026
- PDF: Download PDF