[Paper] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Published: 3 days ago (February 27, 2026 at 01:58 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.24286v1

Overview

The paper introduces CUDA Agent, a reinforcement‑learning (RL) system that teaches a large language model (LLM) to write high‑performance CUDA kernels. By combining massive synthetic data, a skill‑augmented development environment, and stable RL training tricks, the authors achieve kernel speeds that beat both open‑source compiler pipelines (e.g., torch.compile) and leading proprietary LLMs on the demanding KernelBench benchmark.

Key Contributions

Scalable data synthesis pipeline – automatically generates millions of diverse CUDA kernels together with ground‑truth performance labels, eliminating the need for costly hand‑crafted datasets.
Skill‑augmented development environment – integrates a sandboxed CUDA compiler, runtime verifier, and profiler that feed precise reward signals (correctness + execution time) back to the model.
Agentic RL framework – adapts Proximal Policy Optimization (PPO) with custom reward shaping and curriculum learning to stabilize training on the massive synthetic corpus.
State‑of‑the‑art benchmark results – on KernelBench Level‑1, Level‑2, and Level‑3 splits, CUDA Agent is 100 %, 100 %, and 92 % faster than torch.compile, and ~40 % faster than top proprietary models (Claude Opus 4.5, Gemini 3 Pro) on the hardest tasks.
Open‑source tooling – the authors release the data generation scripts, the verification/profiling sandbox, and a lightweight inference wrapper, enabling the community to reproduce and extend the work.

Methodology

Synthetic Kernel Generation
- A “kernel generator” program creates random but syntactically valid CUDA kernels across a wide range of patterns (memory accesses, thread block configurations, math operations).
- Each generated kernel is compiled and profiled on a suite of GPU hardware (e.g., NVIDIA A100, RTX 4090) to obtain ground‑truth latency and resource‑usage metrics.
Skill‑Augmented Development Environment
- The environment wraps nvcc/clang‑cuda and a custom runtime harness that automatically checks functional correctness (via unit‑test inputs) and measures execution time.
- Rewards are a weighted sum: correctness (binary) + speedup (relative to a baseline kernel) + resource efficiency (e.g., register pressure).
Reinforcement Learning Loop
- The LLM (a 13‑B parameter transformer) is fine‑tuned with PPO.
- Curriculum learning: start with easy kernels (Level‑1), gradually introduce more complex patterns (Level‑2/3).
- Reward shaping: penalize large code size and excessive shared‑memory usage to avoid “over‑optimizing” at the cost of portability.
- Stabilization tricks: KL‑control, adaptive learning rates, and gradient clipping keep training from diverging despite the noisy reward signal from profiling.
Evaluation
- The final model is tested on the public KernelBench suite, which groups kernels by difficulty (Level‑1: simple element‑wise ops, Level‑3: complex reductions, tiling, and memory‑bound workloads).

Results & Findings

Benchmark split	Torch.compile speedup	CUDA Agent speedup	Proprietary LLM (Claude Opus 4.5)
Level‑1	1.0× (baseline)	2.0×	~1.4×
Level‑2	1.0×	2.0×	~1.5×
Level‑3	1.0×	1.92×	~1.4×

Correctness: > 99 % of generated kernels pass the automated test suite, showing the reward system effectively balances speed with functional safety.
Hardware portability: The same model achieves comparable gains on both NVIDIA A100 and consumer‑grade RTX 4090, indicating learned optimizations are not hardware‑overfitted.
Ablation studies: Removing the profiling‑based reward drops speedup to ~1.2×, confirming that fine‑grained performance feedback is crucial.

Practical Implications

Developer productivity: Integrating CUDA Agent into IDEs or CI pipelines could auto‑generate optimized kernels from high‑level specifications, freeing engineers from low‑level tuning.
Framework acceleration: Deep‑learning libraries (PyTorch, TensorFlow) could replace their static compilation back‑ends with a learned kernel generator, yielding immediate runtime gains for custom ops.
Cost reduction: Faster kernels mean lower GPU time for training and inference, directly translating to cheaper cloud bills, especially for large‑scale models.
Hardware‑agnostic optimization: Because the model learns patterns that generalize across GPU generations, it can future‑proof codebases as new architectures arrive.
Open‑source ecosystem: The released data pipeline enables other teams to train domain‑specific agents (e.g., for OpenCL, SYCL, or FPGA HLS), potentially democratizing high‑performance code generation beyond CUDA.

Limitations & Future Work

Synthetic bias: Although the generator covers many patterns, it may miss niche kernels that arise in specialized scientific codes, limiting out‑of‑distribution performance.
Profiling overhead: The reward loop requires actual GPU execution, which is expensive for very large datasets; future work could explore surrogate models to predict performance.
Model size: The current 13‑B transformer is still sizable for on‑device inference; distillation or quantization techniques are needed for tight integration in edge environments.
Safety guarantees: While functional correctness is verified, the system does not yet enforce memory‑safety or power‑efficiency constraints that some production settings demand.

The authors suggest extending the curriculum to include multi‑kernel pipelines, exploring cross‑hardware transfer (e.g., AMD GPUs), and integrating static analysis to further tighten the safety net.

Authors

Weinan Dai
Hanlin Wu
Qiying Yu
Huan-ang Gao
Jiahao Li
Chengquan Jiang
Weiqiang Lou
Yufan Song
Hongli Yu
Jiaze Chen
Wei-Ying Ma
Ya-Qin Zhang
Jingjing Liu
Mingxuan Wang
Xin Liu
Hao Zhou

Paper Information

arXiv ID: 2602.24286v1
Categories: cs.LG, cs.AI
Published: February 27, 2026
PDF: Download PDF

[Paper] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

[Paper] Do LLMs Benefit From Their Own Words?

[Paper] Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation