[Paper] Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

Published: 2 days ago (February 10, 2026 at 03:12 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.10262v1

Overview

The paper dives into the performance quirks of AMD’s newest MI300A accelerator, which bundles CDNA3 GPUs, high‑bandwidth memory, FP8 matrix cores, an Asynchronous Compute Engine (ACE), and 2:4 structured sparsity. By running a suite of micro‑benchmarks, the authors expose how these features behave in real‑world HPC and AI workloads and offer concrete guidance for getting the most out of them.

Key Contributions

First‑ever execution‑centric profiling of FP8 matrix‑core pipelines on the MI300A, revealing occupancy limits and latency/throughput trade‑offs.
Quantitative analysis of ACE concurrency, showing how multiple compute streams share resources, where fairness breaks down, and how to tune launch parameters for optimal overlap.
System‑level study of 2:4 structured sparsity, demonstrating context‑dependent speed‑ups (up to ~2×) and the conditions under which sparsity hurts performance.
Case‑study evaluations on transformer‑style kernels, mixed‑precision GEMMs, and concurrent workloads that map the micro‑benchmark insights to end‑to‑end application behavior.
Practical scheduling heuristics (occupancy‑aware launch sizing, ACE throttling thresholds, sparsity enablement rules) that can be directly baked into compilers or runtime systems.

Methodology

Micro‑benchmark suite – The authors built tiny kernels that isolate each hardware feature:
- FP8 matrix‑core kernels varying tile size, thread‑block count, and data layout.
- ACE tests that launch up to 8 independent streams with controllable dependency chains.
- Structured‑sparsity kernels that toggle the 2:4 mask on/off for different matrix shapes and densities.
Instrumentation – AMD’s ROCm profiling stack (rocprof, roctx) captured:
- Core occupancy, wavefront launch latency, and memory‑traffic metrics.
- ACE queue depth, stall cycles, and cross‑stream interference.
- Effective FLOP counts vs. theoretical peaks for sparse vs. dense execution.
Workload mapping – The micro‑benchmarks were then embedded into three representative workloads:
- Transformer attention (FP8‑dominant, heavy matrix multiplies).
- Mixed‑precision GEMM (FP16 + FP8, typical in training pipelines).
- Concurrent inference (multiple independent requests sharing the same GPU).
Statistical analysis – Each experiment was repeated 30+ times to capture variance, and the authors used regression to model how occupancy, ACE depth, and sparsity ratio affect throughput and latency.

Results & Findings

Feature	Key Metric	Observation
FP8 matrix cores	Peak occupancy ≈ 85 % (beyond this, wavefront stalls rise sharply)	Small tile sizes (64×64) give the best utilization; larger tiles waste compute due to register pressure.
ACE concurrency	Up to 4 streams achieve near‑linear throughput; > 4 streams cause > 15 % fairness loss	ACE throttles when total wavefront count exceeds ~12 k; a “soft limit” of 4‑5 concurrent kernels maximizes both latency and fairness.
2:4 structured sparsity	Speed‑up ranges from 1.2× (dense‑ish matrices) to 2.0× (≥ 70 % zero‑pattern compliance)	Sparsity benefits vanish for irregular shapes or when the mask forces extra padding; the overhead of mask handling can offset gains.
Transformer case study	End‑to‑end latency ↓ 23 % with FP8 + ACE (4 streams) + sparsity enabled	The combined effect of all three features matches the micro‑benchmark predictions, confirming the model’s applicability.
Mixed‑precision GEMM	Throughput ↑ 1.8× vs. FP16‑only when using FP8 matrix cores and occupancy‑aware launch	Properly sizing the kernel to stay under the occupancy ceiling is critical; otherwise performance regresses to FP16 levels.
Concurrent inference	Latency variance reduced by 30 % using ACE‑aware scheduling	By capping concurrent streams at 4 and staggering launches, tail latency becomes far more predictable.

Practical Implications

Kernel developers should target 64×64 or 128×128 FP8 tiles and keep active wavefronts below ~10 k to stay in the “sweet spot” of matrix‑core occupancy.
Runtime systems (e.g., ROCm, TensorRT, PyTorch XLA) can embed a simple heuristic: if total pending kernels > 4, delay new launches or split work to avoid ACE fairness collapse.
Compilers can automatically enable 2:4 structured sparsity for layers that naturally produce ≥ 70 % zero patterns (e.g., post‑pruning transformers) and insert padding only when the shape aligns with the hardware mask.
Scheduler designers for multi‑tenant GPU nodes can use the paper’s occupancy‑aware model to predict tail latency and allocate resources more deterministically, which is crucial for serving large‑scale inference workloads.
Mixed‑precision training pipelines can replace FP16 GEMMs with FP8 matrix‑core calls, gaining up to 2× throughput without sacrificing model accuracy (the authors verified this on a BERT‑base fine‑tune).

Overall, the findings give developers a concrete checklist for when and how to turn on each of MI300A’s advanced features, turning what would be a “black‑box” accelerator into a tunable performance knob.

Limitations & Future Work

The study focuses on micro‑benchmarks and three specific workloads; broader AI models (e.g., diffusion, graph neural nets) may exhibit different sparsity patterns or memory footprints.
Power and thermal constraints were not measured; sustained high occupancy could trigger throttling on longer runs.
The authors note that future ROCm releases may expose finer‑grained ACE controls, which could shift the optimal concurrency thresholds.
Extending the methodology to multi‑node MI300A clusters (NVLink/Infinity Fabric interconnect) and evaluating communication‑overlap would be a natural next step.

Bottom line: This paper demystifies the MI300A’s newest hardware tricks and equips developers with actionable rules to squeeze out the best performance for next‑gen HPC and AI workloads.

Authors

Aaron Jarmusch
Connor Vitz
Sunita Chandrasekaran

Paper Information

arXiv ID: 2602.10262v1
Categories: cs.DC, cs.AR
Published: February 10, 2026
PDF: Download PDF

[Paper] Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Legitimate Overrides in Decentralized Protocols

[Paper] OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

[Paper] Contention Resolution, With and Without a Global Clock

[Paper] An Auction-Based Mechanism for Optimal Task Allocation and Resource Aware Containerization