Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

Published: 3 days ago (March 1, 2026 at 08:19 PM EST)

4 min read

Source: Dev.to

Introduction

In the previous post I introduced GPU Flight, a lightweight CUDA observability tool that acts like a flight recorder for your GPU. It collects system metrics, device capabilities, and per‑kernel events.

Today we focus on one specific metric that GPU Flight captures: occupancy. Occupancy is a key indicator of GPU performance, yet it is often misunderstood.

What Is Occupancy?

A GPU is organized into Streaming Multiprocessors (SMs). Each SM can run many threads simultaneously—not by context‑switching like a CPU, but by truly executing them in parallel. The scheduling unit on an SM is a warp: a group of 32 threads that execute the same instruction in lockstep.

An SM has a fixed warp budget (e.g., 48 warps on a typical Ampere GPU).
If you launch a kernel with blocks of 256 threads (8 warps per block), the SM can hold up to 6 blocks concurrently to fill those 48 warp slots. Anything that consumes additional resources—registers, shared memory—may reduce the number of blocks that fit, leaving some warp slots idle.

[ \text{occupancy} = \frac{\text{active warps}}{\text{maximum warps per SM}} ]

1.0 → every warp slot is used.
0.5 → only half of the SM’s compute capacity is utilized.

How GPU Flight Measures Occupancy

GPU Flight records occupancy automatically for every kernel launch—no code changes required. Enable it with enableKernelDetails: true and the information appears in the log:

{
  "type": "kernel_event",
  "name": "_Z18block_reduce_naivePKfPfi",
  "occupancy": 0.833333,
  "num_regs": 16,
  "static_shared_bytes": 16384,
  "dyn_shared_bytes": 0,
  "block": "(256,1,1)",
  "grid": "(16384,1,1)",
  "max_active_blocks": 5
}

Under the hood GPU Flight calls cudaOccupancyMaxActiveBlocksPerMultiprocessor at kernel launch time to obtain max_active_blocks, then divides by the SM’s warp budget to compute occupancy. This occurs inside a CUPTI callback, adding zero overhead to kernel execution.

An occupancy of 0.833333 tells you that the kernel only fills 5 out of the possible 6 concurrent blocks on each SM—some compute capacity is left unused.

Per‑Resource Occupancy Breakdown

To pinpoint the limiting factor, GPU Flight now provides a per‑resource occupancy breakdown and automatically identifies the limiting resource.

Example Kernel (Naïve Block Reduction)

// block_reduce_naive.cu
__global__ void block_reduce_naive(const float* in, float* out, int n) {
    __shared__ float smem[4096]; // 16 KB – statically reserved

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;

    // Load one element per thread into shared memory
    smem[tid] = (gid  0; s >>= 1) {
        if (tid >>(d_in, d_out, N);

GPU Flight flags a problem immediately:

{
  "occupancy":       0.833333,
  "reg_occupancy":   1.0,
  "smem_occupancy":  0.833333,
  "warp_occupancy":  1.0,
  "block_occupancy":1.0,
  "limiting_resource":"shared_mem"
}

Each *_occupancy field answers: “If only this constraint existed, what would occupancy be?”
limiting_resource indicates the actual bottleneck. Here, shared memory matches the overall occupancy, while registers, warps, and block count are not limiting.

Why Shared Memory Is the Culprit

__shared__ float smem[4096] reserves 16 KB of static shared memory per block, regardless of how much the kernel actually uses. With 256 threads per block, the reduction only touches indices 0 … 255 (1 KB), but the full 16 KB remains locked for the block’s lifetime. This excessive reservation prevents the SM from scheduling the maximum number of concurrent blocks.

Optimizing with Dynamic Shared Memory

Switch to dynamic shared memory, which is sized at launch time:

// block_reduce_optimized.cu
__global__ void block_reduce_optimized(const float* in, float* out, int n) {
    extern __shared__ float smem[]; // size supplied at launch

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;

    smem[tid] = (gid  0; s >>= 1) {
        if (tid >>(d_in, d_out, N);

Now the shared‑memory footprint drops from 16 KB to 1 KB per block (16× smaller), allowing the SM to host all 6 concurrent blocks.

GPU Flight confirms the improvement:

{
  "occupancy": 1.0,
  "limiting_resource": "warps"
}

When limiting_resource becomes warps, it means full occupancy—every SM warp slot is filled and shared memory is no longer a bottleneck.

Takeaway

Occupancy is a concise metric that reveals how well an SM’s compute resources are utilized.
A single occupancy number can hide the underlying cause; per‑resource breakdowns expose the true limiter (registers, shared memory, or block count).
Using dynamic shared memory (or otherwise reducing static allocations) often unlocks higher occupancy and better performance.

GPU Flight automates the detection and reporting of these issues, letting you focus on fixing the kernel rather than manually crunching hardware limits.

Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

Introduction

What Is Occupancy?

How GPU Flight Measures Occupancy

Per‑Resource Occupancy Breakdown

Example Kernel (Naïve Block Reduction)

Why Shared Memory Is the Culprit

Optimizing with Dynamic Shared Memory

Takeaway

Related posts

nabla — Pure Rust GPU math engine: PyTorch-familiar API, zero C++ deps, 4 backends

Tracing Discord's Elixir Systems (Without Melting Everything)

Semantic Invalidation That Doesn't Suck

Dev Drive: Desempenho, Segurança e Controle para Desenvolvedores