[Paper] Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A
Source: arXiv - 2602.10262v1
Overview
The paper dives into the performance quirks of AMD’s newest MI300A accelerator, which bundles CDNA3 GPUs, high‑bandwidth memory, FP8 matrix cores, an Asynchronous Compute Engine (ACE), and 2:4 structured sparsity. By running a suite of micro‑benchmarks, the authors expose how these features behave in real‑world HPC and AI workloads and offer concrete guidance for getting the most out of them.
Key Contributions
- First‑ever execution‑centric profiling of FP8 matrix‑core pipelines on the MI300A, revealing occupancy limits and latency/throughput trade‑offs.
- Quantitative analysis of ACE concurrency, showing how multiple compute streams share resources, where fairness breaks down, and how to tune launch parameters for optimal overlap.
- System‑level study of 2:4 structured sparsity, demonstrating context‑dependent speed‑ups (up to ~2×) and the conditions under which sparsity hurts performance.
- Case‑study evaluations on transformer‑style kernels, mixed‑precision GEMMs, and concurrent workloads that map the micro‑benchmark insights to end‑to‑end application behavior.
- Practical scheduling heuristics (occupancy‑aware launch sizing, ACE throttling thresholds, sparsity enablement rules) that can be directly baked into compilers or runtime systems.
Methodology
-
Micro‑benchmark suite – The authors built tiny kernels that isolate each hardware feature:
- FP8 matrix‑core kernels varying tile size, thread‑block count, and data layout.
- ACE tests that launch up to 8 independent streams with controllable dependency chains.
- Structured‑sparsity kernels that toggle the 2:4 mask on/off for different matrix shapes and densities.
-
Instrumentation – AMD’s ROCm profiling stack (rocprof, roctx) captured:
- Core occupancy, wavefront launch latency, and memory‑traffic metrics.
- ACE queue depth, stall cycles, and cross‑stream interference.
- Effective FLOP counts vs. theoretical peaks for sparse vs. dense execution.
-
Workload mapping – The micro‑benchmarks were then embedded into three representative workloads:
- Transformer attention (FP8‑dominant, heavy matrix multiplies).
- Mixed‑precision GEMM (FP16 + FP8, typical in training pipelines).
- Concurrent inference (multiple independent requests sharing the same GPU).
-
Statistical analysis – Each experiment was repeated 30+ times to capture variance, and the authors used regression to model how occupancy, ACE depth, and sparsity ratio affect throughput and latency.
Results & Findings
| Feature | Key Metric | Observation |
|---|---|---|
| FP8 matrix cores | Peak occupancy ≈ 85 % (beyond this, wavefront stalls rise sharply) | Small tile sizes (64×64) give the best utilization; larger tiles waste compute due to register pressure. |
| ACE concurrency | Up to 4 streams achieve near‑linear throughput; > 4 streams cause > 15 % fairness loss | ACE throttles when total wavefront count exceeds ~12 k; a “soft limit” of 4‑5 concurrent kernels maximizes both latency and fairness. |
| 2:4 structured sparsity | Speed‑up ranges from 1.2× (dense‑ish matrices) to 2.0× (≥ 70 % zero‑pattern compliance) | Sparsity benefits vanish for irregular shapes or when the mask forces extra padding; the overhead of mask handling can offset gains. |
| Transformer case study | End‑to‑end latency ↓ 23 % with FP8 + ACE (4 streams) + sparsity enabled | The combined effect of all three features matches the micro‑benchmark predictions, confirming the model’s applicability. |
| Mixed‑precision GEMM | Throughput ↑ 1.8× vs. FP16‑only when using FP8 matrix cores and occupancy‑aware launch | Properly sizing the kernel to stay under the occupancy ceiling is critical; otherwise performance regresses to FP16 levels. |
| Concurrent inference | Latency variance reduced by 30 % using ACE‑aware scheduling | By capping concurrent streams at 4 and staggering launches, tail latency becomes far more predictable. |
Practical Implications
- Kernel developers should target 64×64 or 128×128 FP8 tiles and keep active wavefronts below ~10 k to stay in the “sweet spot” of matrix‑core occupancy.
- Runtime systems (e.g., ROCm, TensorRT, PyTorch XLA) can embed a simple heuristic: if total pending kernels > 4, delay new launches or split work to avoid ACE fairness collapse.
- Compilers can automatically enable 2:4 structured sparsity for layers that naturally produce ≥ 70 % zero patterns (e.g., post‑pruning transformers) and insert padding only when the shape aligns with the hardware mask.
- Scheduler designers for multi‑tenant GPU nodes can use the paper’s occupancy‑aware model to predict tail latency and allocate resources more deterministically, which is crucial for serving large‑scale inference workloads.
- Mixed‑precision training pipelines can replace FP16 GEMMs with FP8 matrix‑core calls, gaining up to 2× throughput without sacrificing model accuracy (the authors verified this on a BERT‑base fine‑tune).
Overall, the findings give developers a concrete checklist for when and how to turn on each of MI300A’s advanced features, turning what would be a “black‑box” accelerator into a tunable performance knob.
Limitations & Future Work
- The study focuses on micro‑benchmarks and three specific workloads; broader AI models (e.g., diffusion, graph neural nets) may exhibit different sparsity patterns or memory footprints.
- Power and thermal constraints were not measured; sustained high occupancy could trigger throttling on longer runs.
- The authors note that future ROCm releases may expose finer‑grained ACE controls, which could shift the optimal concurrency thresholds.
- Extending the methodology to multi‑node MI300A clusters (NVLink/Infinity Fabric interconnect) and evaluating communication‑overlap would be a natural next step.
Bottom line: This paper demystifies the MI300A’s newest hardware tricks and equips developers with actionable rules to squeeze out the best performance for next‑gen HPC and AI workloads.
Authors
- Aaron Jarmusch
- Connor Vitz
- Sunita Chandrasekaran
Paper Information
- arXiv ID: 2602.10262v1
- Categories: cs.DC, cs.AR
- Published: February 10, 2026
- PDF: Download PDF