[Paper] PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations

Published: (December 21, 2025 at 11:15 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.19018v1

Overview

The paper presents PEAK, an AI‑assistant that leverages large language models (LLMs) to automatically transform and optimize GPU kernel code. By expressing low‑level performance tweaks as natural‑language instructions, PEAK can generate, validate, and benchmark optimized kernels across multiple GPU back‑ends (CUDA, HIP, HLSL), achieving performance on par with hand‑tuned vendor libraries.

Key Contributions

  • Natural‑language transformation pipeline – Encodes iterative GPU optimizations as plain English prompts that LLMs can execute.
  • Modular, extensible infrastructure – Handles code generation, correctness validation, and performance measurement for any GPU backend.
  • Cross‑backend support – Demonstrated on CUDA, AMD’s HIP, and DirectX HLSL, showing the approach is hardware‑agnostic.
  • Empirical evaluation – Implemented 16 transformations for matrix‑multiplication kernels; results match or exceed vendor‑provided libraries, and for HLSL reach the documented FLOPS ceiling.
  • Research platform – Enables systematic study of LLM behavior on low‑level code, error patterns, and performance trajectories across optimization sequences.

Methodology

  1. Define transformations in natural language – Each optimization (e.g., “tile the loops with a 32×32 block”, “unroll the innermost loop three times”) is written as a concise English instruction.
  2. Prompt the LLM – The instruction, together with the current kernel source, is fed to a large language model (e.g., GPT‑4). The model returns modified source code.
  3. Validate & benchmark – An automated harness compiles the generated kernel for the target backend, runs a correctness test suite, and measures runtime/FLOPS.
  4. Iterate – Successful transformations are chained; failures trigger fallback or re‑prompting.
  5. Extensibility – New backends or transformations are added by supplying the appropriate compiler wrappers and natural‑language specs—no changes to the core LLM logic are required.

Results & Findings

BackendBaseline (naïve kernel)Optimized by PEAKVendor library (if any)
CUDA1.2× slower than cuBLAS≈ 0.95× cuBLAScuBLAS (reference)
HIP1.4× slower than rocBLAS≈ 1.0× rocBLASrocBLAS (reference)
HLSL2.3× slower than theoretical peak≈ 1.0× hardware FLOPS limitNo official library
  • Correctness: All generated kernels passed the supplied test suite; the infrastructure caught and rejected transformations that introduced subtle bugs.
  • Error patterns: Most LLM mistakes were syntactic (missing semicolons) or semantic (using an undefined variable). Prompt engineering and post‑processing filters reduced these to <5 % of attempts.
  • Optimization trajectory: Performance gains were not monotonic; certain early transformations (e.g., aggressive unrolling) could degrade later tiling steps, highlighting the need for a feedback loop.

Practical Implications

  • Accelerated performance engineering – Developers can describe a desired optimization in plain English and obtain a working, benchmarked kernel in minutes, cutting down the manual tuning cycle that traditionally takes days or weeks.
  • Cross‑platform portability – Because transformations are backend‑agnostic, a single set of natural‑language specs can generate tuned kernels for NVIDIA, AMD, and DirectX GPUs, simplifying multi‑vendor codebases.
  • AI‑augmented CI pipelines – PEAK can be integrated into continuous integration to automatically suggest or apply performance improvements whenever a kernel changes.
  • Rapid prototyping for emerging hardware – When new GPU architectures appear, engineers only need to update the backend compiler wrappers; the same natural‑language transformations can be re‑run to discover optimal settings without hand‑crafting new kernels.
  • Foundation for autonomous agents – PEAK’s “plug‑and‑play” design enables higher‑level AI agents to drive end‑to‑end kernel optimization without human intervention, opening the door to self‑optimizing libraries.

Limitations & Future Work

  • LLM dependence – Quality of generated code hinges on the underlying model; newer, more capable models are expected to reduce error rates.
  • Transformation expressiveness – Complex, non‑local optimizations (e.g., algorithmic redesign) are still hard to capture with simple natural‑language prompts.
  • Scalability of validation – Full correctness testing for large kernels can be time‑consuming; future work may incorporate static analysis or symbolic execution to speed up validation.
  • Broader kernel families – The study focused on matrix multiplication; extending the approach to irregular workloads (e.g., graph kernels, sparse linear algebra) is an open research direction.

PEAK demonstrates that natural language can serve as a practical bridge between high‑level performance intent and low‑level GPU code, turning LLMs into genuine performance‑engineering partners.

Authors

  • Muhammad Usman Tariq
  • Abhinav Jangda
  • Angelica Moreira
  • Madan Musuvathi
  • Tyler Sorensen

Paper Information

  • arXiv ID: 2512.19018v1
  • Categories: cs.SE
  • Published: December 22, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »