[Paper] KEET: Explaining Performance of GPU Kernels Using LLM Agents

Published: 5 days ago (May 5, 2026 at 11:47 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.04467v1

Overview

The paper introduces KEET (Kernel Execution Explanation Toolkit), a framework that leverages large language models (LLMs) as “agents” to turn the dense, technical output of NVIDIA’s Nsight Compute profiling tool into clear, natural‑language explanations of why a GPU kernel is slow and how it can be tuned. By bridging the gap between raw performance counters and developer‑friendly insights, KEET aims to cut down the time engineers spend digging through graphs and tables to locate bottlenecks.

Key Contributions

Agentic LLM pipeline that parses Nsight Compute reports, extracts relevant metrics, and generates grounded textual explanations of performance issues.
Optimization suggestion module that couples the explanations with concrete code‑level recommendations (e.g., memory‑access patterns, occupancy tweaks).
Empirical evaluation on a suite of CUDA kernels (simple to complex) running on NVIDIA H100 GPUs, showing that KEET‑augmented prompts improve downstream LLM tasks such as code optimization and multiple‑choice Q&A.
Scalable analysis demonstrated by processing large batches of profiles, enabling the tool to surface common patterns across many kernels and feed better optimization advice back to developers.

Methodology

Profile Extraction – Nsight Compute is run on a target kernel; its CSV/JSON output (thousands of counters, roofline metrics, etc.) is collected.
LLM Agent Design – A chain‑of‑thought prompting strategy is used:
- Parsing Agent identifies the most salient counters (e.g., SM utilization, memory bandwidth, stall reasons).
- Explanation Agent translates those numbers into plain English, citing the specific metric values that support each claim.
- Recommendation Agent maps the identified bottleneck to a set of known CUDA best‑practice fixes (e.g., increase thread‑block size, use shared memory, adjust launch configuration).
Prompt Engineering – The generated explanation is inserted as context into subsequent LLM queries (e.g., “Suggest code changes to improve this kernel”).
Evaluation – The authors compare three setups:
- (a) raw LLM with no profile context,
- (b) LLM with raw numeric counters, and
- (c) LLM with KEET’s natural‑language explanation.
  Accuracy on optimization‑suggestion tasks and multiple‑choice questions is measured.

Results & Findings

Explanation quality matters – When KEET’s textual summary is provided as context, LLMs achieve up to 23 % higher accuracy on multiple‑choice questions about kernel performance compared to raw numeric input.
Better optimization suggestions – In code‑generation experiments, the LLM produces more relevant and implementable changes (e.g., “replace global loads with __ldg intrinsics”) when guided by KEET explanations.
Batch processing gains – Analyzing hundreds of profiles at once lets KEET surface recurring issues (e.g., low L2 hit rate) and automatically prioritize them for the development team.
Human validation – A small user study with GPU developers reported that KEET’s explanations reduced the time to pinpoint the primary bottleneck by ≈40 % on average.

Practical Implications

Faster debugging cycles – Developers can paste a Nsight Compute report into KEET and instantly receive a concise “why‑is‑it‑slow” paragraph, cutting down manual inspection.
LLM‑assisted code refactoring – By feeding KEET’s output into code‑generation models (e.g., GitHub Copilot, Claude), teams can get more accurate, performance‑aware suggestions, reducing trial‑and‑error loops.
Continuous integration – KEET can be scripted into CI pipelines to automatically flag performance regressions and suggest fixes before code lands in production.
Education & onboarding – New GPU programmers can learn performance‑tuning concepts faster, as KEET translates low‑level metrics into the high‑level reasoning they need to understand.

Limitations & Future Work

Model dependence – The quality of explanations hinges on the underlying LLM; smaller or less‑capable models may produce vague or incorrect advice.
Profile coverage – KEET currently focuses on Nsight Compute data from NVIDIA GPUs; extending to AMD ROCm or Intel oneAPI profiling tools would broaden its applicability.
Static rule set – Recommendations are drawn from a curated list of known CUDA optimizations; the system may miss novel or architecture‑specific tricks that fall outside this knowledge base.
Scalability of parsing – Extremely large profiles (e.g., from multi‑kernel workloads) can strain the token limits of current LLM APIs, requiring chunking strategies.

Future research directions include integrating reinforcement learning to let the agent “test” suggested changes on the fly, expanding the knowledge base with community‑contributed optimization patterns, and supporting cross‑vendor profiling formats.

Authors

Joshua H. Davis
Klaudiusz Rydzy
Srinivasan Ramesh
Aadit Nilay
Daniel Nichols
Swapna Raj
Nikhil Jain
Abhinav Bhatele

Paper Information

arXiv ID: 2605.04467v1
Categories: cs.PF, cs.DC
Published: May 6, 2026
PDF: Download PDF

[Paper] KEET: Explaining Performance of GPU Kernels Using LLM Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole