[Paper] FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion

Published: 1 month ago (December 26, 2025 at 09:16 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.22036v1

Overview

The paper introduces FUSCO, a new communication library designed to speed up the data‑shuffling step that dominates training and inference for large‑scale Mixture‑of‑Experts (MoE) models. By fusing the data‑layout transformation with the actual network transfer, FUSCO cuts the communication overhead that can consume more than half of the total runtime in existing systems.

Key Contributions

Transformation‑Communication Fusion: Detects the mismatch between MoE’s expert‑major layout and the device‑major layout required by GPU communication primitives, and merges the layout conversion with the send/receive operations.
Pipelined Communication Engine: Executes the fused operations along a multi‑hop path, keeping the network and compute pipelines busy and avoiding idle periods.
Lightweight Planning & Load Balancing: Generates a compact communication plan that removes redundant transfers and spreads traffic evenly across devices, preventing bottlenecks.
Performance Gains: Demonstrates up to 3.84× speedup over NCCL and 2.01× over DeepEP on synthetic benchmarks, and 1.10–1.39× reductions in end‑to‑end training latency for real MoE workloads.
Open‑Source Prototype: Provides a drop‑in replacement for existing MoE frameworks, requiring only minimal code changes.

Methodology

Layout Analysis: FUSCO first inspects the token‑to‑expert routing map produced by the MoE gating network. This map reveals a fine‑grained “expert‑major” ordering (all tokens for expert E₁, then E₂, …).
Fusion Planning: Instead of first rearranging tensors to a “device‑major” order (as NCCL does) and then sending them, FUSCO builds a fusion plan that tells each GPU how to slice its local buffer, transform the slice on‑the‑fly, and directly push it to the target device.
Pipelined Engine: The plan is executed by a lightweight runtime that overlaps three stages: (a) local data extraction, (b) transformation (e.g., transpose/reshape), and (c) network send/receive. Because these stages are streamed, the GPU can continue processing the next slice while the previous one is in flight.
Load Balancing: The engine monitors per‑device traffic and dynamically redistributes small “spill‑over” chunks to under‑utilized links, ensuring no single NIC becomes a hotspot.
Integration: FUSCO exposes the same API surface as NCCL/DeepEP, so existing MoE codebases (e.g., DeepSpeed, Megatron‑LM) can swap the backend without rewriting the routing logic.

Results & Findings

Benchmark	Baseline (NCCL)	DeepEP	FUSCO	Speedup vs. NCCL	Speedup vs. DeepEP
Synthetic expert‑shuffle (64 GPUs)	1.24 s	0.62 s	0.32 s	3.84×	2.01×
GPT‑3‑style MoE training (128 GPUs)	1.87 s/step	1.71 s/step	1.48 s/step	1.26×	1.16×
Inference first‑token latency (32 GPUs)	12.4 ms	11.6 ms	10.8 ms	1.15×	1.07×

Communication dominates: Profiling shows that shuffling accounts for ≈55 % of total step time with NCCL, dropping to ≈30 % with FUSCO.
Scalability: Gains increase with the number of experts and GPUs because the fusion eliminates an O(N²) data‑reordering cost that otherwise explodes.
Minimal overhead: The planning phase adds < 0.5 ms even for the largest runs, confirming the “lightweight” claim.

Practical Implications

Faster MoE training cycles: Teams can iterate on larger expert counts or deeper models without being throttled by communication, shortening time‑to‑research.
Lower cloud cost: Reducing per‑step runtime translates directly into fewer GPU‑hours, which is especially valuable for pay‑as‑you‑go cloud providers.
Improved inference latency: Real‑time services (e.g., large‑scale language model APIs) benefit from the reduced first‑token latency, making MoE‑based models viable for latency‑sensitive applications.
Drop‑in adoption: Because FUSCO mimics the NCCL API, existing pipelines (DeepSpeed, FairScale, Megatron‑LM) can integrate it with a single library swap, lowering the barrier for production deployment.
Hardware‑agnostic gains: The approach works on any GPU with standard RDMA/NCCL support, so it can be leveraged on both on‑premise clusters and major cloud platforms.

Limitations & Future Work

Assumes static routing per step: FUSCO’s planning relies on a fixed token‑to‑expert map for the duration of a training step. Highly dynamic gating (e.g., per‑token changes mid‑step) would require re‑planning, incurring extra cost.
Focus on GPU‑to‑GPU links: The current prototype does not address heterogeneous environments (e.g., CPU‑offload or TPU clusters), which may need different transformation kernels.
Memory overhead: The fusion engine keeps temporary buffers for on‑the‑fly transposes; on memory‑constrained GPUs this could limit the maximum batch size.
Future directions: Extending the fusion concept to other collective operations (all‑reduce, broadcast) in MoE pipelines, exploring adaptive planning that reacts to runtime traffic patterns, and integrating with emerging interconnects (e.g., NVLink‑C2C, InfiniBand HDR) for even higher throughput.

Authors

Zhuoran Zhu
Chunyang Zhu
Hao Lin
Xu Fu
Yiming Zhou
Quanlu Zhang
Zhenhua Li
Feng Qian
Chao Yu
Boxun Li
Guohao Dai
Yu Wang

Paper Information

arXiv ID: 2512.22036v1
Categories: cs.DC
Published: December 26, 2025
PDF: Download PDF

[Paper] FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Proceedings First Workshop on Adaptable Cloud Architectures

[Paper] Robust Federated Fine-Tuning in Heterogeneous Networks with Unreliable Connections: An Aggregation View

[Paper] BLEST: Blazingly Efficient BFS using Tensor Cores

[Paper] LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices