[Paper] FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion
Source: arXiv - 2512.22036v1
Overview
The paper introduces FUSCO, a new communication library designed to speed up the data‑shuffling step that dominates training and inference for large‑scale Mixture‑of‑Experts (MoE) models. By fusing the data‑layout transformation with the actual network transfer, FUSCO cuts the communication overhead that can consume more than half of the total runtime in existing systems.
Key Contributions
- Transformation‑Communication Fusion: Detects the mismatch between MoE’s expert‑major layout and the device‑major layout required by GPU communication primitives, and merges the layout conversion with the send/receive operations.
- Pipelined Communication Engine: Executes the fused operations along a multi‑hop path, keeping the network and compute pipelines busy and avoiding idle periods.
- Lightweight Planning & Load Balancing: Generates a compact communication plan that removes redundant transfers and spreads traffic evenly across devices, preventing bottlenecks.
- Performance Gains: Demonstrates up to 3.84× speedup over NCCL and 2.01× over DeepEP on synthetic benchmarks, and 1.10–1.39× reductions in end‑to‑end training latency for real MoE workloads.
- Open‑Source Prototype: Provides a drop‑in replacement for existing MoE frameworks, requiring only minimal code changes.
Methodology
- Layout Analysis: FUSCO first inspects the token‑to‑expert routing map produced by the MoE gating network. This map reveals a fine‑grained “expert‑major” ordering (all tokens for expert E₁, then E₂, …).
- Fusion Planning: Instead of first rearranging tensors to a “device‑major” order (as NCCL does) and then sending them, FUSCO builds a fusion plan that tells each GPU how to slice its local buffer, transform the slice on‑the‑fly, and directly push it to the target device.
- Pipelined Engine: The plan is executed by a lightweight runtime that overlaps three stages: (a) local data extraction, (b) transformation (e.g., transpose/reshape), and (c) network send/receive. Because these stages are streamed, the GPU can continue processing the next slice while the previous one is in flight.
- Load Balancing: The engine monitors per‑device traffic and dynamically redistributes small “spill‑over” chunks to under‑utilized links, ensuring no single NIC becomes a hotspot.
- Integration: FUSCO exposes the same API surface as NCCL/DeepEP, so existing MoE codebases (e.g., DeepSpeed, Megatron‑LM) can swap the backend without rewriting the routing logic.
Results & Findings
| Benchmark | Baseline (NCCL) | DeepEP | FUSCO | Speedup vs. NCCL | Speedup vs. DeepEP |
|---|---|---|---|---|---|
| Synthetic expert‑shuffle (64 GPUs) | 1.24 s | 0.62 s | 0.32 s | 3.84× | 2.01× |
| GPT‑3‑style MoE training (128 GPUs) | 1.87 s/step | 1.71 s/step | 1.48 s/step | 1.26× | 1.16× |
| Inference first‑token latency (32 GPUs) | 12.4 ms | 11.6 ms | 10.8 ms | 1.15× | 1.07× |
- Communication dominates: Profiling shows that shuffling accounts for ≈55 % of total step time with NCCL, dropping to ≈30 % with FUSCO.
- Scalability: Gains increase with the number of experts and GPUs because the fusion eliminates an O(N²) data‑reordering cost that otherwise explodes.
- Minimal overhead: The planning phase adds < 0.5 ms even for the largest runs, confirming the “lightweight” claim.
Practical Implications
- Faster MoE training cycles: Teams can iterate on larger expert counts or deeper models without being throttled by communication, shortening time‑to‑research.
- Lower cloud cost: Reducing per‑step runtime translates directly into fewer GPU‑hours, which is especially valuable for pay‑as‑you‑go cloud providers.
- Improved inference latency: Real‑time services (e.g., large‑scale language model APIs) benefit from the reduced first‑token latency, making MoE‑based models viable for latency‑sensitive applications.
- Drop‑in adoption: Because FUSCO mimics the NCCL API, existing pipelines (DeepSpeed, FairScale, Megatron‑LM) can integrate it with a single library swap, lowering the barrier for production deployment.
- Hardware‑agnostic gains: The approach works on any GPU with standard RDMA/NCCL support, so it can be leveraged on both on‑premise clusters and major cloud platforms.
Limitations & Future Work
- Assumes static routing per step: FUSCO’s planning relies on a fixed token‑to‑expert map for the duration of a training step. Highly dynamic gating (e.g., per‑token changes mid‑step) would require re‑planning, incurring extra cost.
- Focus on GPU‑to‑GPU links: The current prototype does not address heterogeneous environments (e.g., CPU‑offload or TPU clusters), which may need different transformation kernels.
- Memory overhead: The fusion engine keeps temporary buffers for on‑the‑fly transposes; on memory‑constrained GPUs this could limit the maximum batch size.
- Future directions: Extending the fusion concept to other collective operations (all‑reduce, broadcast) in MoE pipelines, exploring adaptive planning that reacts to runtime traffic patterns, and integrating with emerging interconnects (e.g., NVLink‑C2C, InfiniBand HDR) for even higher throughput.
Authors
- Zhuoran Zhu
- Chunyang Zhu
- Hao Lin
- Xu Fu
- Yiming Zhou
- Quanlu Zhang
- Zhenhua Li
- Feng Qian
- Chao Yu
- Boxun Li
- Guohao Dai
- Yu Wang
Paper Information
- arXiv ID: 2512.22036v1
- Categories: cs.DC
- Published: December 26, 2025
- PDF: Download PDF