[Paper] DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
Source: arXiv - 2601.21824v1
Overview
Training large language models (LLMs) at scale demands reproducible results, but deterministic attention kernels—especially the backward pass—can slash throughput by up to 38 % compared with their faster, non‑deterministic counterparts. This paper introduces DASH (Deterministic Attention Scheduling for High‑throughput), a set of scheduling tricks that reorganize the compute‑and‑gradient‑reduction steps in the deterministic backward pass, reclaiming up to 1.28× of the lost performance while preserving exact numerical reproducibility.
Key Contributions
- Formal DAG model: Cast the deterministic attention backward pass as a directed‑acyclic‑graph scheduling problem, enabling systematic analysis of pipeline stalls.
- Two scheduling strategies:
- Descending Q‑Tile Iteration – a reversed query‑block traversal that reduces idle time in causal attention.
- Shift Scheduling – a provably optimal schedule (within the DAG model) that minimizes stalls for both full‑mask and causal masks.
- Empirical validation: Demonstrated up to 1.28× speed‑up on NVIDIA H800 GPUs across a range of LLM sizes, narrowing the deterministic‑vs‑non‑deterministic gap.
- Open‑source implementation: Released the codebase (https://github.com/SJTU-Liquid/deterministic-FA3) for easy integration with existing FlashAttention‑3 pipelines.
Methodology
- Backward pass decomposition: The authors break down deterministic attention into three phases—query/key/value (QKV) matmul, attention‑score computation, and gradient reduction—and map the data dependencies onto a DAG.
- Critical‑path analysis: By measuring the longest dependency chain, they identify where the pipeline stalls (mostly during serialized gradient reductions).
- Schedule design:
- Descending Q‑Tile Iteration processes query tiles from the last to the first, allowing earlier tiles to start reduction while later tiles are still computing, thus overlapping work.
- Shift Scheduling introduces a systematic offset (a “shift”) between compute and reduction steps, aligning them so that each GPU SM (streaming multiprocessor) stays busy throughout the backward pass.
- Implementation: Both strategies are integrated into the existing FlashAttention‑3 kernel stack with minimal code changes, preserving the same memory layout and numerical guarantees.
Results & Findings
| Configuration | Baseline (deterministic FA3) | DASH (best schedule) | Speed‑up |
|---|---|---|---|
| Full‑mask, 70B model, H800 | 1.00× (reference) | 1.22× | +22 % |
| Causal‑mask, 13B model, H800 | 1.00× | 1.28× | +28 % |
| Mixed‑precision, 30B model | 1.00× | 1.15× | +15 % |
- Throughput gap between deterministic and non‑deterministic attention shrank from ~38 % to under 20 % in most tested scenarios.
- Memory overhead remained unchanged; the schedules only reshuffle existing operations.
- Numerical reproducibility was fully retained—bit‑wise identical gradients compared to the original deterministic implementation.
Practical Implications
- Faster reproducible training pipelines: Teams that need exact reproducibility (e.g., for regulatory compliance, scientific benchmarking, or debugging) can now adopt deterministic attention without paying the full performance penalty.
- Lower hardware costs: Recovering up to ~30 % of throughput translates directly into fewer GPU‑hours for large‑scale LLM pre‑training, cutting cloud spend.
- Drop‑in integration: Since DASH builds on top of FlashAttention‑3, developers can swap in the new kernels with a single library update, preserving existing model code and optimizer logic.
- Enables more aggressive checkpointing: Faster backward passes free up time for additional reproducible checkpoints or gradient‑accumulation steps, improving training stability for massive models.
Limitations & Future Work
- GPU‑specific tuning: The current evaluation targets NVIDIA H800; performance gains on other architectures (e.g., AMD Instinct, upcoming Hopper GPUs) remain to be quantified.
- Mask‑type coverage: While full and causal masks are addressed, exotic attention masks (e.g., block‑sparse or rotary‑position‑based) may require custom schedule extensions.
- Theoretical optimality bound: Shift Scheduling is optimal within the DAG abstraction; real‑world factors like memory bandwidth contention could still leave room for further improvements.
- Future directions: Extending the DAG model to multi‑node distributed training, exploring adaptive schedule selection based on runtime profiling, and integrating with other deterministic kernels (e.g., optimizer updates).
Authors
- Xinwei Qiang
- Hongmin Chen
- Shixuan Sun
- Jingwen Leng
- Xin Liu
- Minyi Guo
Paper Information
- arXiv ID: 2601.21824v1
- Categories: cs.LG, cs.DC
- Published: January 29, 2026
- PDF: Download PDF