[Paper] Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A

Published: (February 10, 2026 at 03:12 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.10262v1

Overview

The paper dives into the performance quirks of AMD’s newest MI300A accelerator, which bundles CDNA3 GPUs, high‑bandwidth memory, FP8 matrix cores, an Asynchronous Compute Engine (ACE), and 2:4 structured sparsity. By running a suite of micro‑benchmarks, the authors expose how these features behave in real‑world HPC and AI workloads and offer concrete guidance for getting the most out of them.

Key Contributions

  • First‑ever execution‑centric profiling of FP8 matrix‑core pipelines on the MI300A, revealing occupancy limits and latency/throughput trade‑offs.
  • Quantitative analysis of ACE concurrency, showing how multiple compute streams share resources, where fairness breaks down, and how to tune launch parameters for optimal overlap.
  • System‑level study of 2:4 structured sparsity, demonstrating context‑dependent speed‑ups (up to ~2×) and the conditions under which sparsity hurts performance.
  • Case‑study evaluations on transformer‑style kernels, mixed‑precision GEMMs, and concurrent workloads that map the micro‑benchmark insights to end‑to‑end application behavior.
  • Practical scheduling heuristics (occupancy‑aware launch sizing, ACE throttling thresholds, sparsity enablement rules) that can be directly baked into compilers or runtime systems.

Methodology

  1. Micro‑benchmark suite – The authors built tiny kernels that isolate each hardware feature:

    • FP8 matrix‑core kernels varying tile size, thread‑block count, and data layout.
    • ACE tests that launch up to 8 independent streams with controllable dependency chains.
    • Structured‑sparsity kernels that toggle the 2:4 mask on/off for different matrix shapes and densities.
  2. Instrumentation – AMD’s ROCm profiling stack (rocprof, roctx) captured:

    • Core occupancy, wavefront launch latency, and memory‑traffic metrics.
    • ACE queue depth, stall cycles, and cross‑stream interference.
    • Effective FLOP counts vs. theoretical peaks for sparse vs. dense execution.
  3. Workload mapping – The micro‑benchmarks were then embedded into three representative workloads:

    • Transformer attention (FP8‑dominant, heavy matrix multiplies).
    • Mixed‑precision GEMM (FP16 + FP8, typical in training pipelines).
    • Concurrent inference (multiple independent requests sharing the same GPU).
  4. Statistical analysis – Each experiment was repeated 30+ times to capture variance, and the authors used regression to model how occupancy, ACE depth, and sparsity ratio affect throughput and latency.

Results & Findings

FeatureKey MetricObservation
FP8 matrix coresPeak occupancy ≈ 85 % (beyond this, wavefront stalls rise sharply)Small tile sizes (64×64) give the best utilization; larger tiles waste compute due to register pressure.
ACE concurrencyUp to 4 streams achieve near‑linear throughput; > 4 streams cause > 15 % fairness lossACE throttles when total wavefront count exceeds ~12 k; a “soft limit” of 4‑5 concurrent kernels maximizes both latency and fairness.
2:4 structured sparsitySpeed‑up ranges from 1.2× (dense‑ish matrices) to 2.0× (≥ 70 % zero‑pattern compliance)Sparsity benefits vanish for irregular shapes or when the mask forces extra padding; the overhead of mask handling can offset gains.
Transformer case studyEnd‑to‑end latency ↓ 23 % with FP8 + ACE (4 streams) + sparsity enabledThe combined effect of all three features matches the micro‑benchmark predictions, confirming the model’s applicability.
Mixed‑precision GEMMThroughput ↑ 1.8× vs. FP16‑only when using FP8 matrix cores and occupancy‑aware launchProperly sizing the kernel to stay under the occupancy ceiling is critical; otherwise performance regresses to FP16 levels.
Concurrent inferenceLatency variance reduced by 30 % using ACE‑aware schedulingBy capping concurrent streams at 4 and staggering launches, tail latency becomes far more predictable.

Practical Implications

  • Kernel developers should target 64×64 or 128×128 FP8 tiles and keep active wavefronts below ~10 k to stay in the “sweet spot” of matrix‑core occupancy.
  • Runtime systems (e.g., ROCm, TensorRT, PyTorch XLA) can embed a simple heuristic: if total pending kernels > 4, delay new launches or split work to avoid ACE fairness collapse.
  • Compilers can automatically enable 2:4 structured sparsity for layers that naturally produce ≥ 70 % zero patterns (e.g., post‑pruning transformers) and insert padding only when the shape aligns with the hardware mask.
  • Scheduler designers for multi‑tenant GPU nodes can use the paper’s occupancy‑aware model to predict tail latency and allocate resources more deterministically, which is crucial for serving large‑scale inference workloads.
  • Mixed‑precision training pipelines can replace FP16 GEMMs with FP8 matrix‑core calls, gaining up to 2× throughput without sacrificing model accuracy (the authors verified this on a BERT‑base fine‑tune).

Overall, the findings give developers a concrete checklist for when and how to turn on each of MI300A’s advanced features, turning what would be a “black‑box” accelerator into a tunable performance knob.

Limitations & Future Work

  • The study focuses on micro‑benchmarks and three specific workloads; broader AI models (e.g., diffusion, graph neural nets) may exhibit different sparsity patterns or memory footprints.
  • Power and thermal constraints were not measured; sustained high occupancy could trigger throttling on longer runs.
  • The authors note that future ROCm releases may expose finer‑grained ACE controls, which could shift the optimal concurrency thresholds.
  • Extending the methodology to multi‑node MI300A clusters (NVLink/Infinity Fabric interconnect) and evaluating communication‑overlap would be a natural next step.

Bottom line: This paper demystifies the MI300A’s newest hardware tricks and equips developers with actionable rules to squeeze out the best performance for next‑gen HPC and AI workloads.

Authors

  • Aaron Jarmusch
  • Connor Vitz
  • Sunita Chandrasekaran

Paper Information

  • arXiv ID: 2602.10262v1
  • Categories: cs.DC, cs.AR
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »