[Paper] Fine-Tuning GPT-5 for GPU Kernel Generation

Published: 2 months ago (February 11, 2026 at 11:22 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.11000v1

Overview

The paper presents Makora, a reinforcement‑learning (RL) framework that fine‑tunes GPT‑5 to generate high‑performance GPU kernels written in Triton. By moving beyond supervised fine‑tuning—hampered by scarce, high‑quality kernel data—the authors show that RL can dramatically improve both correctness and speed of generated kernels, closing the gap with hand‑crafted compilers like TorchInductor.

Key Contributions

RL‑based fine‑tuning pipeline for a frontier LLM (GPT‑5) targeting GPU kernel generation.
Makora environment that integrates a Triton compiler, runtime validator, and performance profiler for automated reward calculation.
Significant accuracy boost: single‑attempt kernel correctness rises from 43.7 % to 77.0 % (‑+33.3 pp).
Performance gains: problems beating TorchInductor increase from 14.8 % to 21.8 %; full‑agent solves 97.4 % of an expanded KernelBench suite, with a 2.12× geometric‑mean speedup on 72.9 % of cases.
Demonstration of data‑efficient learning in a domain where labeled data is prohibitively expensive.

Methodology

Problem Set & Benchmark – The authors curated a diverse set of kernel synthesis tasks from KernelBench, covering matrix ops, reductions, and convolutions across multiple GPU architectures.
Makora RL Loop
- Policy Model: GPT‑5 (pre‑trained on general code).
- Action: Generate a Triton kernel given a high‑level description.
- Environment:
  - Compilation with the Triton compiler.
  - Execution on a target GPU to verify functional correctness.
  - Profiling to measure runtime and memory usage.
- Reward Function: Composite score = correctness (binary) + scaled speed‑up over a baseline (TorchInductor) + penalties for compilation failures or excessive resource use.
RL Algorithm – Proximal Policy Optimization (PPO) with a modest batch size (≈256 kernels per update) to keep training compute‑efficient.
Curriculum & Sampling – Early epochs focus on simpler kernels (e.g., element‑wise ops) before moving to more complex patterns, ensuring stable learning despite sparse high‑reward events.
Evaluation – After each training checkpoint, the model is tested on a held‑out KernelBench split, measuring both single‑attempt correctness and full‑agent success (allowing iterative refinement).

Results & Findings

Metric	Baseline GPT‑5	Fine‑tuned (RL)
Single‑attempt correctness	43.7 %	77.0 % (+33.3 pp)
Fraction beating TorchInductor	14.8 %	21.8 % (+7 pp)
Full‑agent solve rate (expanded suite)	—	97.4 %
Problems where agent outperforms TorchInductor	—	72.9 %
Geometric‑mean speedup (outperforming cases)	1.0×	2.12×

Key takeaways:

RL can compensate for the lack of labeled data by turning the compiler/runtime itself into a source of feedback.
The fine‑tuned model learns hardware‑aware idioms (e.g., shared‑memory tiling) that were rarely present in the pre‑training corpus.
Even a single‑attempt generation sees a dramatic jump in functional correctness, making the model viable for integration into developer tools.

Practical Implications

Developer Assistants – IDE plugins could suggest ready‑to‑run Triton kernels, cutting down the trial‑and‑error cycle that currently dominates GPU performance engineering.
Accelerator‑Specific Codegen – Companies building custom ASICs or newer GPU generations can plug their own compiler/runtime into Makora, quickly adapting a generic LLM to their hardware without needing massive kernel datasets.
Automated Optimization Pipelines – CI/CD systems could automatically replace hand‑written kernels with RL‑enhanced LLM outputs, yielding immediate runtime savings (e.g., 2× speedup on common linear‑algebra kernels).
Lower Barrier to Entry – Researchers and startups lacking deep GPU‑kernel expertise can leverage the model to prototype high‑performance kernels, accelerating AI model scaling on limited hardware budgets.

Limitations & Future Work

Hardware Diversity – Experiments focus on a single GPU family; cross‑generation generalization remains an open question.
Reward Sparsity – Extremely complex kernels still suffer from low reward signal, limiting scalability to very large workloads.
Safety & Correctness Guarantees – The RL loop validates functional correctness but does not formally verify numerical stability or memory safety.
Future Directions – Extending Makora to other accelerator languages (e.g., CUDA, SYCL), incorporating multi‑objective rewards (energy, memory bandwidth), and exploring hybrid supervised‑RL curricula to further reduce training time.

Authors

Ali Tehrani
Yahya Emara
Essam Wissam
Wojciech Paluch
Waleed Atallah
Łukasz Dudziak
Mohamed S. Abdelfattah

Paper Information

arXiv ID: 2602.11000v1
Categories: cs.DC, cs.AI, cs.LG
Published: February 11, 2026
PDF: Download PDF

[Paper] Fine-Tuning GPT-5 for GPU Kernel Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Study: Self-generated Agent Skills are useless

OpenAI has officially retired the controversial GPT-4o model

[Paper] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

GPT-5 outperforms federal judges 100% to 52% in legal reasoning experiment