[Paper] Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

Published: 3 days ago (February 26, 2026 at 03:44 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.22756v1

Overview

All‑to‑all communication among GPUs is a major performance choke point in modern deep‑learning clusters. This paper introduces a dynamic hierarchical Birkhoff‑von Neumann (BvN) decomposition that exploits the two‑level topology of today’s GPU farms (fast intra‑server links, slower inter‑server networks) to schedule traffic more efficiently and keep the network stable under realistic workloads.

Key Contributions

Two‑tier scheduling framework – first balances traffic inside each server, then performs a server‑level BvN decomposition, finally refines it to per‑GPU matchings.
Dynamic hierarchical BvN algorithm – reduces the combinatorial explosion of a flat GPU‑level decomposition, cutting runtime from O(G³) to roughly O(S·G²) (G = GPUs per server, S = number of servers).
Integration with Dynamic Frame Sizing (DFS) – adapts the length of scheduling frames on‑the‑fly, guaranteeing queue stability for admissible Poisson traffic.
Theoretical guarantees – provable stability (bounded queues) under any traffic matrix that satisfies the network’s capacity region.
Extensive simulation study – shows up to 40 % lower mean frame length and dramatically less latency for hotspot patterns that concentrate traffic on a few servers.

Methodology

Model the traffic matrix as a bipartite graph where each GPU needs to send a certain amount of data to every other GPU.
Frame‑based scheduling: time is divided into frames; at each frame boundary the scheduler decides which links are active.
Local intra‑server balancing: simple linear operations (e.g., redistributing packets among NICs) equalize the load across GPUs inside the same server while keeping the total amount that the server must send/receive unchanged.
Hierarchical BvN decomposition:
- Server‑level – treat each server as a “super‑node” and decompose the aggregated server‑to‑server demand into a convex combination of permutation matrices (matchings).
- GPU‑level refinement – each server‑level matching is further broken down into concrete GPU‑to‑GPU matchings using the already‑balanced intra‑server traffic.
Dynamic Frame Sizing (DFS): the frame length is adjusted based on the current backlog, ensuring that the scheduler never falls behind the incoming Poisson arrivals.
Stability analysis: using queueing theory, the authors prove that if the arrival rates lie within the network’s capacity region, the queues remain bounded.

Results & Findings

Scenario	Metric	Flat BvN (baseline)	Hierarchical BvN (proposed)
Uniform traffic	Mean frame length	1.0 × baseline	0.78×
Server‑localized hotspot (80 % of traffic originates from 2 out of 8 servers)	Mean frame length	1.0 × baseline	0.60×
Varying Poisson arrival rates (up to 90 % of capacity)	Queue size (max)	grows unbounded	stable
Decomposition time (per frame)	Computation time	12 ms	3 ms

Latency reduction: shorter frames translate directly into lower end‑to‑end communication latency, especially for workloads with bursty, skewed traffic.
Scalability: the hierarchical approach scales linearly with the number of servers, making it practical for clusters with dozens of servers and hundreds of GPUs.
Robustness: the DFS component automatically shrinks frames when queues build up, preventing congestion collapse.

Practical Implications

Deep‑learning frameworks (e.g., PyTorch Distributed, TensorFlow) can plug the scheduler into their NCCL or custom collective back‑ends to achieve faster all‑to‑all ops without hardware changes.
Cloud GPU providers can improve utilization of existing inter‑connects (e.g., NVLink intra‑node, Ethernet/InfiniBand inter‑node) by reshaping traffic on‑the‑fly, leading to higher throughput per dollar.
System‑software developers gain a concrete algorithmic recipe for hierarchical traffic shaping that can be implemented in the kernel driver or as a user‑space daemon with minimal overhead.
Performance engineers now have a provably stable scheduling primitive that can be combined with other collective optimizations (e.g., pipelined reductions, compression) for end‑to‑end training speedups.

Limitations & Future Work

Assumes Poisson arrivals and known traffic matrices at frame boundaries; real workloads may exhibit more complex burst patterns.
Simulation‑only validation: the paper evaluates the algorithm in a simulated environment; a production‑grade implementation on actual GPU clusters is needed to confirm real‑world gains.
Extension to heterogeneous fabrics (different NIC speeds, mixed‑precision links) is not covered and could affect the decomposition’s optimality.
Future research directions include adaptive learning of traffic patterns, integration with compression techniques, and hardware support (e.g., NIC offload) to further reduce scheduler latency.

Authors

Yen-Chieh Wu
Cheng-Shang Chang
Duan-Shin Lee
H. Jonathan Chao

Paper Information

arXiv ID: 2602.22756v1
Categories: cs.NI, cs.DC
Published: February 26, 2026
PDF: Download PDF

[Paper] Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploiting network topology in brain-scale simulations of spiking neural networks

[Paper] STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

[Paper] A High-Throughput AES-GCM Implementation on GPUs for Secure, Policy-Based Access to Massive Astronomical Catalogs

[Paper] A Simple Distributed Deterministic Planar Separator