[Paper] DynamiQ: Accelerating Gradient Synchronization using Compressed Multi-hop All-reduce

Published: 3 days ago (February 9, 2026 at 12:25 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.08923v1

Overview

Training massive deep‑learning models today relies on multi‑hop all‑reduce to aggregate gradients across many GPUs or nodes. As models grow, the network traffic for this step becomes a critical bottleneck. DynamiQ proposes a new gradient‑compression scheme that is specially tuned for the multi‑hop reduction pattern, delivering up to 34 % speed‑ups while preserving almost the full precision of BF16 training.

Key Contributions

Compression scheme for partial sums – a quantization format that accurately represents values that have already been summed a few times along the reduction tree.
Fused decompress‑accumulate‑recompress kernel – a single CUDA kernel that decompresses incoming data, adds it to the local gradient buffer, and recompresses the result, minimizing memory traffic and kernel launch overhead.
Integration with PyTorch DDP & NCCL P2P – a drop‑in replacement for the default all‑reduce path that works with existing PyTorch Distributed Data Parallel (DDP) codebases.
Comprehensive evaluation – experiments on several large language models (LLMs), vision models, and NLP tasks show consistent gains over Omni‑Reduce, THC, and the MXFP4/6/8 standards.
Near‑baseline accuracy – DynamiQ retains ≥ 99.9 % of the BF16 validation performance, a first among evaluated compression methods at comparable speedups.

Methodology

Understanding multi‑hop aggregation – In a typical all‑reduce tree, each node receives partially summed gradients from its children, adds its own contribution, and forwards the new sum upward. Traditional quantizers treat every value as a fresh gradient, ignoring that many entries have already been summed, which leads to larger quantization error.
Designing a “partial‑sum‑aware” quantizer – DynamiQ models the statistical distribution of these intermediate sums and allocates more bits to the most significant range while using stochastic rounding for the tail. This yields a compact representation (e.g., 4‑bit or 6‑bit) that still captures the accumulated magnitude.
Fused kernel implementation – A custom CUDA kernel performs three steps in one pass:
- Decompress the incoming compressed buffer into FP16 registers.
- Accumulate it with the local gradient buffer (still in FP16/BF16).
- Recompress the updated buffer back to the DynamiQ format before sending it onward.
  By keeping everything on‑chip, the kernel eliminates extra memory copies and reduces latency.
System integration – The authors extended PyTorch’s DistributedDataParallel (DDP) backend to invoke the DynamiQ kernel over NCCL’s peer‑to‑peer (P2P) primitives, preserving the existing programming model for developers.

Results & Findings

Benchmark	Baseline (BF16)	Best prior method	DynamiQ (speedup)	Accuracy relative to BF16
LLaMA‑7B (text)	1.00×	Omni‑Reduce (1.12×)	1.34×	99.92 %
ResNet‑152 (vision)	1.00×	MXFP6 (1.08×)	1.22×	99.95 %
BERT‑large (NLP)	1.00×	THC (1.10×)	1.28×	99.94 %

Across all tested models, DynamiQ consistently outperformed the strongest competitor by 5–15 % in throughput.
The compression ratio ranged from 4‑bit to 8‑bit, yet the fused kernel kept the per‑iteration overhead under 2 µs on a 8‑GPU DGX‑A100 cluster.
Accuracy degradation was negligible; in every case the final validation loss differed by less than 0.001 from the full‑precision run.

Practical Implications

Faster training cycles – Teams can shave hours (or days) off large‑scale experiments without buying additional hardware.
Cost savings on cloud GPU clusters – Reduced network traffic translates to lower inter‑node bandwidth charges and allows higher node density per rack.
Ease of adoption – Because DynamiQ plugs into PyTorch DDP, existing codebases need only a few lines of configuration to enable the new all‑reduce path.
Future‑proofing for emerging hardware – The approach works with NCCL P2P and can be ported to other collective libraries (e.g., RCCL, MPI), making it suitable for upcoming GPU interconnects (NVLink‑3, AMD Infinity Fabric).
Potential for mixed‑precision pipelines – DynamiQ’s quantizer can be combined with optimizer‑state compression techniques, further reducing overall memory and communication footprints.

Limitations & Future Work

Hardware dependency – The current implementation relies on CUDA‑specific kernels; a pure‑CPU or ROCm version is not yet available.
Fixed compression levels – While DynamiQ supports several bit‑widths, dynamic adaptation (e.g., per‑layer or per‑iteration bitrate) was not explored.
Scalability beyond 8‑node clusters – Experiments stopped at 8‑node setups; the authors note that ultra‑large clusters may expose new synchronization patterns that require additional tuning.
Future directions include extending the quantizer to optimizer states, integrating with emerging collective APIs (e.g., NCCL‑2.20’s hierarchical all‑reduce), and exploring learned quantization parameters that adapt during training.

Authors

Wenchen Han
Shay Vargaftik
Michael Mitzenmacher
Ran Ben Basat

Paper Information

arXiv ID: 2602.08923v1
Categories: cs.LG, cs.DC, cs.NI
Published: February 9, 2026
PDF: Download PDF

[Paper] DynamiQ: Accelerating Gradient Synchronization using Compressed Multi-hop All-reduce

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] YOR: Your Own Mobile Manipulator for Generalizable Robotics

[Paper] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

[Paper] SCRAPL: Scattering Transform with Random Paths for Machine Learning