[Paper] FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Published: 15 hours ago (March 5, 2026 at 01:24 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.05451v1

Overview

FlashAttention‑4 tackles the growing performance gap that appears when the newest Blackwell‑based GPUs (e.g., NVIDIA B200/GB200) replace the older Hopper GPUs for running large Transformer models. By co‑designing the attention algorithm and its low‑level kernel pipeline, the authors squeeze out up to 2.7× speed‑up on real‑world workloads while keeping the implementation fully programmable in Python.

Key Contributions

Asymmetric‑hardware aware pipeline redesign – new asynchronous MMA (matrix‑multiply‑accumulate) schedules and larger tile sizes that match the doubled tensor‑core throughput of Blackwell GPUs.
Software‑emulated exponential & softmax rescaling – replaces costly hardware exponentials with fast, numerically‑stable Python‑level tricks, cutting non‑matmul work.
Tensor‑memory & 2‑CTA MMA mode – leverages the new tensor‑memory feature and a two‑CTA (cooperative thread array) mode to slash shared‑memory traffic and eliminate atomic adds in the backward pass.
CuTe‑DSL implementation – the entire kernel stack is written in NVIDIA’s CuTe domain‑specific language, embedded in Python, delivering 20–30× faster compile times versus traditional C++ template kernels while preserving full expressivity.
Comprehensive performance evaluation – demonstrates up to 1.3× speed‑up over cuDNN 9.13 and 2.7× over Triton on B200 GPUs, hitting 1613 TFLOP/s (≈ 71 % of peak) in BF16.

Methodology

Profiling the new hardware – The team measured how Blackwell GPUs scale each functional unit (tensor cores, shared memory, exponential units) and identified the new bottlenecks.
Algorithm‑kernel co‑design
- Pipeline redesign: They split the attention computation into three stages (load, compute, store) and made each stage fully asynchronous, allowing the tensor cores to stay busy while data moves in the background.
- Larger tiles: By increasing the tile size to match the higher tensor‑core bandwidth, each MMA operation processes more data per launch, reducing kernel launch overhead.
Softmax rescaling tricks – Instead of invoking the GPU’s exponential unit for every element, they compute a scaling factor once per tile and apply it with cheap element‑wise multiplications, dramatically lowering the number of exponentials.
Tensor‑memory & 2‑CTA mode – The new tensor‑memory allows direct reads/writes of intermediate results without staging in shared memory. The 2‑CTA mode lets two cooperative thread blocks share a single MMA engine, halving the number of atomic adds needed for gradient accumulation.
Implementation in CuTe‑DSL – The authors expressed the whole pipeline in CuTe, a Python‑embedded DSL that generates highly‑tuned CUDA kernels on the fly, cutting compile time from minutes to seconds.

Results & Findings

Baseline	FlashAttention‑4 (BF16)	Speed‑up	TFLOP/s	Utilization
cuDNN 9.13 (B200)	—	1.0×	600	~26 %
Triton (B200)	—	1.0×	590	~25 %
FlashAttention‑4	BF16	1.3× (vs cuDNN)	1613	71 %
FlashAttention‑4	BF16	2.7× (vs Triton)	1613	71 %

Forward pass: The asynchronous pipeline keeps tensor cores saturated, eliminating idle cycles caused by memory stalls.
Backward pass: The 2‑CTA MMA mode reduces atomic‑add contention, giving a ~30 % boost over FlashAttention‑3’s backward implementation.
Compile time: CuTe‑DSL generated kernels in ≈2 seconds, whereas a comparable C++ template implementation required ≈45 seconds on the same machine.

Overall, FlashAttention‑4 reaches 71 % of the theoretical BF16 peak on Blackwell GPUs—a substantial leap compared to prior attention kernels that hovered around 30–40 %.

Practical Implications

Faster inference & training for LLMs – Developers can shave milliseconds off each token generation step, translating into lower latency for chat‑bots, code assistants, and retrieval‑augmented generation services.
Cost savings on cloud GPU instances – Higher throughput means fewer GPU‑hours for the same amount of work, directly reducing operational expenses for AI SaaS providers.
Ease of integration – Because the kernel is exposed through a Python API (via CuTe), existing PyTorch or JAX pipelines can drop‑in the new attention implementation without rewriting low‑level CUDA code.
Future‑proofing – The asymmetric‑hardware aware design anticipates further scaling trends (e.g., even faster tensor cores) and can be adapted to upcoming GPU generations with minimal changes.
Open‑source potential – If released, the CuTe‑DSL source could become a reference for other performance‑critical kernels (e.g., vision transformers, graph attention networks) that need to adapt to hardware asymmetry.

Limitations & Future Work

Hardware specificity – The optimizations exploit Blackwell‑specific features (tensor memory, 2‑CTA MMA). Porting to other architectures (e.g., AMD GPUs) would require a fresh co‑design effort.
Numerical stability trade‑offs – The software‑emulated exponential/rescaling introduces extra rounding steps; while the authors report negligible impact on model quality, edge‑case verification is still needed for extreme‑scale training.
Backward‑pass focus – The paper concentrates on the backward pass for gradient accumulation; extending the same pipeline ideas to other memory‑intensive ops (e.g., layer‑norm) remains an open avenue.
Scalability to multi‑GPU – Experiments are limited to single‑GPU performance. Integrating FlashAttention‑4 into distributed training frameworks (e.g., ZeRO, DeepSpeed) and measuring end‑to‑end scaling is future work.

FlashAttention‑4 shows that a tight algorithm‑kernel co‑design, tuned for the quirks of the latest GPU generation, can deliver dramatic speedups without sacrificing developer ergonomics. For anyone building or deploying large Transformer models on modern NVIDIA hardware, the paper offers a concrete, ready‑to‑use path to higher performance and lower cost.

Authors

Ted Zadouri
Markus Hoehnerbach
Jay Shah
Timmy Liu
Vijay Thakkar
Tri Dao

Paper Information

arXiv ID: 2603.05451v1
Categories: cs.CL
Published: March 5, 2026
PDF: Download PDF

[Paper] FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought