[Paper] PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations

Published: 1 week ago (December 21, 2025 at 11:15 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.19018v1

Overview

The paper presents PEAK, an AI‑assistant that leverages large language models (LLMs) to automatically transform and optimize GPU kernel code. By expressing low‑level performance tweaks as natural‑language instructions, PEAK can generate, validate, and benchmark optimized kernels across multiple GPU back‑ends (CUDA, HIP, HLSL), achieving performance on par with hand‑tuned vendor libraries.

Key Contributions

Natural‑language transformation pipeline – Encodes iterative GPU optimizations as plain English prompts that LLMs can execute.
Modular, extensible infrastructure – Handles code generation, correctness validation, and performance measurement for any GPU backend.
Cross‑backend support – Demonstrated on CUDA, AMD’s HIP, and DirectX HLSL, showing the approach is hardware‑agnostic.
Empirical evaluation – Implemented 16 transformations for matrix‑multiplication kernels; results match or exceed vendor‑provided libraries, and for HLSL reach the documented FLOPS ceiling.
Research platform – Enables systematic study of LLM behavior on low‑level code, error patterns, and performance trajectories across optimization sequences.

Methodology

Define transformations in natural language – Each optimization (e.g., “tile the loops with a 32×32 block”, “unroll the innermost loop three times”) is written as a concise English instruction.
Prompt the LLM – The instruction, together with the current kernel source, is fed to a large language model (e.g., GPT‑4). The model returns modified source code.
Validate & benchmark – An automated harness compiles the generated kernel for the target backend, runs a correctness test suite, and measures runtime/FLOPS.
Iterate – Successful transformations are chained; failures trigger fallback or re‑prompting.
Extensibility – New backends or transformations are added by supplying the appropriate compiler wrappers and natural‑language specs—no changes to the core LLM logic are required.

Results & Findings

Backend	Baseline (naïve kernel)	Optimized by PEAK	Vendor library (if any)
CUDA	1.2× slower than cuBLAS	≈ 0.95× cuBLAS	cuBLAS (reference)
HIP	1.4× slower than rocBLAS	≈ 1.0× rocBLAS	rocBLAS (reference)
HLSL	2.3× slower than theoretical peak	≈ 1.0× hardware FLOPS limit	No official library

Correctness: All generated kernels passed the supplied test suite; the infrastructure caught and rejected transformations that introduced subtle bugs.
Error patterns: Most LLM mistakes were syntactic (missing semicolons) or semantic (using an undefined variable). Prompt engineering and post‑processing filters reduced these to <5 % of attempts.
Optimization trajectory: Performance gains were not monotonic; certain early transformations (e.g., aggressive unrolling) could degrade later tiling steps, highlighting the need for a feedback loop.

Practical Implications

Accelerated performance engineering – Developers can describe a desired optimization in plain English and obtain a working, benchmarked kernel in minutes, cutting down the manual tuning cycle that traditionally takes days or weeks.
Cross‑platform portability – Because transformations are backend‑agnostic, a single set of natural‑language specs can generate tuned kernels for NVIDIA, AMD, and DirectX GPUs, simplifying multi‑vendor codebases.
AI‑augmented CI pipelines – PEAK can be integrated into continuous integration to automatically suggest or apply performance improvements whenever a kernel changes.
Rapid prototyping for emerging hardware – When new GPU architectures appear, engineers only need to update the backend compiler wrappers; the same natural‑language transformations can be re‑run to discover optimal settings without hand‑crafting new kernels.
Foundation for autonomous agents – PEAK’s “plug‑and‑play” design enables higher‑level AI agents to drive end‑to‑end kernel optimization without human intervention, opening the door to self‑optimizing libraries.

Limitations & Future Work

LLM dependence – Quality of generated code hinges on the underlying model; newer, more capable models are expected to reduce error rates.
Transformation expressiveness – Complex, non‑local optimizations (e.g., algorithmic redesign) are still hard to capture with simple natural‑language prompts.
Scalability of validation – Full correctness testing for large kernels can be time‑consuming; future work may incorporate static analysis or symbolic execution to speed up validation.
Broader kernel families – The study focused on matrix multiplication; extending the approach to irregular workloads (e.g., graph kernels, sparse linear algebra) is an open research direction.

PEAK demonstrates that natural language can serve as a practical bridge between high‑level performance intent and low‑level GPU code, turning LLMs into genuine performance‑engineering partners.

Authors

Muhammad Usman Tariq
Abhinav Jangda
Angelica Moreira
Madan Musuvathi
Tyler Sorensen

Paper Information

arXiv ID: 2512.19018v1
Categories: cs.SE
Published: December 22, 2025
PDF: Download PDF

[Paper] PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] HALF: Process Hollowing Analysis Framework for Binary Programs with the Assistance of Kernel Modules

[Paper] Analyzing Code Injection Attacks on LLM-based Multi-Agent Systems in Software Development

[Paper] A Story About Cohesion and Separation: Label-Free Metric for Log Parser Evaluation

[Paper] The State of the SBOM Tool Ecosystems: A Comparative Analysis of SPDX and CycloneDX