[Paper] DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

Published: (January 29, 2026 at 10:10 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.21824v1

Overview

Training large language models (LLMs) at scale demands reproducible results, but deterministic attention kernels—especially the backward pass—can slash throughput by up to 38 % compared with their faster, non‑deterministic counterparts. This paper introduces DASH (Deterministic Attention Scheduling for High‑throughput), a set of scheduling tricks that reorganize the compute‑and‑gradient‑reduction steps in the deterministic backward pass, reclaiming up to 1.28× of the lost performance while preserving exact numerical reproducibility.

Key Contributions

  • Formal DAG model: Cast the deterministic attention backward pass as a directed‑acyclic‑graph scheduling problem, enabling systematic analysis of pipeline stalls.
  • Two scheduling strategies:
    1. Descending Q‑Tile Iteration – a reversed query‑block traversal that reduces idle time in causal attention.
    2. Shift Scheduling – a provably optimal schedule (within the DAG model) that minimizes stalls for both full‑mask and causal masks.
  • Empirical validation: Demonstrated up to 1.28× speed‑up on NVIDIA H800 GPUs across a range of LLM sizes, narrowing the deterministic‑vs‑non‑deterministic gap.
  • Open‑source implementation: Released the codebase (https://github.com/SJTU-Liquid/deterministic-FA3) for easy integration with existing FlashAttention‑3 pipelines.

Methodology

  1. Backward pass decomposition: The authors break down deterministic attention into three phases—query/key/value (QKV) matmul, attention‑score computation, and gradient reduction—and map the data dependencies onto a DAG.
  2. Critical‑path analysis: By measuring the longest dependency chain, they identify where the pipeline stalls (mostly during serialized gradient reductions).
  3. Schedule design:
    • Descending Q‑Tile Iteration processes query tiles from the last to the first, allowing earlier tiles to start reduction while later tiles are still computing, thus overlapping work.
    • Shift Scheduling introduces a systematic offset (a “shift”) between compute and reduction steps, aligning them so that each GPU SM (streaming multiprocessor) stays busy throughout the backward pass.
  4. Implementation: Both strategies are integrated into the existing FlashAttention‑3 kernel stack with minimal code changes, preserving the same memory layout and numerical guarantees.

Results & Findings

ConfigurationBaseline (deterministic FA3)DASH (best schedule)Speed‑up
Full‑mask, 70B model, H8001.00× (reference)1.22×+22 %
Causal‑mask, 13B model, H8001.00×1.28×+28 %
Mixed‑precision, 30B model1.00×1.15×+15 %
  • Throughput gap between deterministic and non‑deterministic attention shrank from ~38 % to under 20 % in most tested scenarios.
  • Memory overhead remained unchanged; the schedules only reshuffle existing operations.
  • Numerical reproducibility was fully retained—bit‑wise identical gradients compared to the original deterministic implementation.

Practical Implications

  • Faster reproducible training pipelines: Teams that need exact reproducibility (e.g., for regulatory compliance, scientific benchmarking, or debugging) can now adopt deterministic attention without paying the full performance penalty.
  • Lower hardware costs: Recovering up to ~30 % of throughput translates directly into fewer GPU‑hours for large‑scale LLM pre‑training, cutting cloud spend.
  • Drop‑in integration: Since DASH builds on top of FlashAttention‑3, developers can swap in the new kernels with a single library update, preserving existing model code and optimizer logic.
  • Enables more aggressive checkpointing: Faster backward passes free up time for additional reproducible checkpoints or gradient‑accumulation steps, improving training stability for massive models.

Limitations & Future Work

  • GPU‑specific tuning: The current evaluation targets NVIDIA H800; performance gains on other architectures (e.g., AMD Instinct, upcoming Hopper GPUs) remain to be quantified.
  • Mask‑type coverage: While full and causal masks are addressed, exotic attention masks (e.g., block‑sparse or rotary‑position‑based) may require custom schedule extensions.
  • Theoretical optimality bound: Shift Scheduling is optimal within the DAG abstraction; real‑world factors like memory bandwidth contention could still leave room for further improvements.
  • Future directions: Extending the DAG model to multi‑node distributed training, exploring adaptive schedule selection based on runtime profiling, and integrating with other deterministic kernels (e.g., optimizer updates).

Authors

  • Xinwei Qiang
  • Hongmin Chen
  • Shixuan Sun
  • Jingwen Leng
  • Xin Liu
  • Minyi Guo

Paper Information

  • arXiv ID: 2601.21824v1
  • Categories: cs.LG, cs.DC
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »