[Paper] DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

Published: 3 months ago (January 29, 2026 at 10:10 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.21824v1

Overview

Training large language models (LLMs) at scale demands reproducible results, but deterministic attention kernels—especially the backward pass—can slash throughput by up to 38 % compared with their faster, non‑deterministic counterparts. This paper introduces DASH (Deterministic Attention Scheduling for High‑throughput), a set of scheduling tricks that reorganize the compute‑and‑gradient‑reduction steps in the deterministic backward pass, reclaiming up to 1.28× of the lost performance while preserving exact numerical reproducibility.

Key Contributions

Formal DAG model: Cast the deterministic attention backward pass as a directed‑acyclic‑graph scheduling problem, enabling systematic analysis of pipeline stalls.
Two scheduling strategies:
1. Descending Q‑Tile Iteration – a reversed query‑block traversal that reduces idle time in causal attention.
2. Shift Scheduling – a provably optimal schedule (within the DAG model) that minimizes stalls for both full‑mask and causal masks.
Empirical validation: Demonstrated up to 1.28× speed‑up on NVIDIA H800 GPUs across a range of LLM sizes, narrowing the deterministic‑vs‑non‑deterministic gap.
Open‑source implementation: Released the codebase (https://github.com/SJTU-Liquid/deterministic-FA3) for easy integration with existing FlashAttention‑3 pipelines.

Methodology

Backward pass decomposition: The authors break down deterministic attention into three phases—query/key/value (QKV) matmul, attention‑score computation, and gradient reduction—and map the data dependencies onto a DAG.
Critical‑path analysis: By measuring the longest dependency chain, they identify where the pipeline stalls (mostly during serialized gradient reductions).
Schedule design:
- Descending Q‑Tile Iteration processes query tiles from the last to the first, allowing earlier tiles to start reduction while later tiles are still computing, thus overlapping work.
- Shift Scheduling introduces a systematic offset (a “shift”) between compute and reduction steps, aligning them so that each GPU SM (streaming multiprocessor) stays busy throughout the backward pass.
Implementation: Both strategies are integrated into the existing FlashAttention‑3 kernel stack with minimal code changes, preserving the same memory layout and numerical guarantees.

Results & Findings

Configuration	Baseline (deterministic FA3)	DASH (best schedule)	Speed‑up
Full‑mask, 70B model, H800	1.00× (reference)	1.22×	+22 %
Causal‑mask, 13B model, H800	1.00×	1.28×	+28 %
Mixed‑precision, 30B model	1.00×	1.15×	+15 %

Throughput gap between deterministic and non‑deterministic attention shrank from ~38 % to under 20 % in most tested scenarios.
Memory overhead remained unchanged; the schedules only reshuffle existing operations.
Numerical reproducibility was fully retained—bit‑wise identical gradients compared to the original deterministic implementation.

Practical Implications

Faster reproducible training pipelines: Teams that need exact reproducibility (e.g., for regulatory compliance, scientific benchmarking, or debugging) can now adopt deterministic attention without paying the full performance penalty.
Lower hardware costs: Recovering up to ~30 % of throughput translates directly into fewer GPU‑hours for large‑scale LLM pre‑training, cutting cloud spend.
Drop‑in integration: Since DASH builds on top of FlashAttention‑3, developers can swap in the new kernels with a single library update, preserving existing model code and optimizer logic.
Enables more aggressive checkpointing: Faster backward passes free up time for additional reproducible checkpoints or gradient‑accumulation steps, improving training stability for massive models.

Limitations & Future Work

GPU‑specific tuning: The current evaluation targets NVIDIA H800; performance gains on other architectures (e.g., AMD Instinct, upcoming Hopper GPUs) remain to be quantified.
Mask‑type coverage: While full and causal masks are addressed, exotic attention masks (e.g., block‑sparse or rotary‑position‑based) may require custom schedule extensions.
Theoretical optimality bound: Shift Scheduling is optimal within the DAG abstraction; real‑world factors like memory bandwidth contention could still leave room for further improvements.
Future directions: Extending the DAG model to multi‑node distributed training, exploring adaptive schedule selection based on runtime profiling, and integrating with other deterministic kernels (e.g., optimizer updates).

Authors

Xinwei Qiang
Hongmin Chen
Shixuan Sun
Jingwen Leng
Xin Liu
Minyi Guo

Paper Information

arXiv ID: 2601.21824v1
Categories: cs.LG, cs.DC
Published: January 29, 2026
PDF: Download PDF

[Paper] DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound