[Paper] Trivance: Latency-Optimal AllReduce by Shortcutting Multiport Networks

Published: (February 19, 2026 at 05:57 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17254v1

Overview

AllReduce is the workhorse collective primitive that underpins distributed training of massive machine‑learning models. The new paper “Trivance: Latency‑Optimal AllReduce by Shortcutting Multiport Networks” introduces an algorithm that keeps the theoretical minimum number of communication steps ( log₃ n ) while dramatically cutting the amount of traffic that traverses the network. In practice this means faster gradient aggregation on the kinds of high‑performance torus‑style interconnects (e.g., Google’s TPUv4) that power today’s largest AI workloads.

Key Contributions

  • Latency‑optimal algorithm with reduced congestion: Trivance matches Bruck’s log₃ n step bound but lowers per‑step traffic by a factor of three.
  • Dual‑port exploitation: The method uses both transmission ports of a bidirectional ring simultaneously, “shortcutting” the distance data travels in each round.
  • Joint reductions: By merging reduction operations across the two directions, Trivance further trims network load without adding extra steps.
  • Extension to multidimensional torus topologies: The design naturally scales to 2‑D and 3‑D torus networks, preserving latency benefits while staying bandwidth‑optimal for large messages.
  • Empirical validation: Experiments on synthetic and real‑world torus fabrics show 5‑30 % speed‑ups over the best existing latency‑optimal schemes for messages up to 8 MiB, and competitive performance up to 32 MiB (2‑D) and 128 MiB (3‑D).

Methodology

  1. Modeling the network: The authors treat a bidirectional ring (the basic building block of a torus) as having two independent send ports—one clockwise, one counter‑clockwise.
  2. Step‑wise shortcutting: In each communication round, every node simultaneously forwards three “chunks” of data: one to each neighbor and one that skips a neighbor, effectively tripling the distance covered per step.
  3. Joint reduction operation: Instead of reducing a single chunk per direction, nodes combine the two incoming partial results into a single reduction, halving the number of reduction operations that need to be performed later.
  4. Recursive composition for torus: A multidimensional torus is decomposed into independent rings along each dimension. Trivance runs on each ring, and the results are merged across dimensions using a carefully ordered schedule that avoids extra hops.
  5. Evaluation framework: The authors implement Trivance in a message‑passing simulator calibrated with real TPUv4 bandwidth/latency numbers, and compare against Bruck’s algorithm, the Swing family, and bandwidth‑optimal ring‑based AllReduce.

Results & Findings

Message size2‑D Torus (log₃ n steps)3‑D Torus (log₃ n steps)Speed‑up vs. BruckSpeed‑up vs. Swing
1 MiB5 %6 %5 %7 %
8 MiB22 %24 %22 %25 %
32 MiB (high‑bw)18 % (bandwidth‑optimal)20 % (bandwidth‑optimal)18 %21 %
128 MiB (3‑D)30 %30 %33 %
  • Latency advantage: Trivance always finishes in ⌈log₃ n⌉ rounds, the theoretical minimum for any latency‑optimal AllReduce.
  • Congestion reduction: The per‑step traffic is roughly one‑third of Bruck’s, leading to less queuing on the limited‑bisection torus links.
  • Bandwidth parity: For large messages the algorithm’s throughput matches that of bandwidth‑optimal ring AllReduce, confirming that the shortcutting does not sacrifice raw data‑move capacity.

Practical Implications

  • Faster distributed training loops: Gradient aggregation becomes quicker, shaving milliseconds off each training step—critical when scaling to thousands of accelerators.
  • Better utilization of existing hardware: No new silicon is required; the algorithm simply re‑orders how existing ports are used, making it a drop‑in replacement for current AllReduce libraries (e.g., NCCL, XLA).
  • Energy savings: Reduced network congestion translates to fewer retransmissions and lower link utilization, cutting power consumption in large data‑center clusters.
  • Applicability beyond AI: Any workload that relies on collective reductions (e.g., distributed graph analytics, scientific simulations) can benefit from the same latency‑optimal shortcutting technique.

Limitations & Future Work

  • Assumes symmetric bidirectional ports: The gains hinge on both directions being able to transmit simultaneously; asymmetric or half‑duplex links would diminish the advantage.
  • Message‑size sweet spot: While Trivance shines for sub‑32 MiB messages, for very large payloads the classic bandwidth‑optimal ring may still be preferable due to simpler scheduling.
  • Hardware‑specific tuning: The current evaluation uses a simulated TPUv4 torus; real‑world performance on other fabrics (e.g., InfiniBand, Ethernet‑based clusters) needs empirical verification.
  • Future directions: Extending the shortcutting concept to heterogeneous topologies (e.g., hierarchical fat‑trees), integrating with adaptive collective libraries that auto‑select the best algorithm per workload, and exploring fault‑tolerant variants for unreliable links.

Authors

  • Anton Juerss
  • Vamsi Addanki
  • Stefan Schmid

Paper Information

  • arXiv ID: 2602.17254v1
  • Categories: cs.DC, cs.NI
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »