[Paper] Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

Published: (February 10, 2026 at 07:24 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.09721v1

Overview

The paper investigates Attention‑FFN Disaggregation (AFD)—a new way to split the attention and feed‑forward network (FFN) components of modern Mixture‑of‑Experts (MoE) models across hardware resources. By extending the classic roofline model to the communication domain, the authors show when AFD can actually beat the traditional Expert Parallelism (EP) approach and when it falls short.

Key Contributions

  • Extended Roofline Analysis: Introduces a communication‑aware roofline model that links interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU).
  • Identification of a “Dead Zone”: Shows that on typical clusters, adding more FFN instances does not raise HFU because the workload becomes limited by scale‑out bandwidth, not compute.
  • Imbalance Quantification: Demonstrates that AFD’s node‑level scaling suffers higher load‑imbalance penalties than EP’s more flexible batch‑wise expert assignment.
  • Hardware‑Model Sweet Spots: Pinpoints the conditions (e.g., Superpod‑class interconnects, coarse‑grained experts, lower sparsity) where AFD can outperform EP.
  • Practical Guidance: Provides a decision framework for engineers to decide whether to adopt AFD based on their hardware topology and model characteristics.

Methodology

  1. Modeling Layer: The authors augment the classic roofline model with a communication roofline that captures the cost of moving expert activations across nodes.
  2. Synthetic Benchmarks: They run a suite of MoE workloads (varying expert count, sparsity, and granularity) on a range of cluster configurations—from commodity Ethernet to high‑speed InfiniBand “Superpod” setups.
  3. Metrics Tracked:
    • Arithmetic Intensity (FLOPs per byte transferred) for attention vs. FFN paths.
    • Hardware FLOPS Utilization (HFU) – the fraction of peak compute actually used.
    • Imbalance Penalty – extra time incurred when some nodes finish early while others are still processing experts.
  4. Comparative Experiments: Each workload is executed under both AFD and EP, keeping the total number of parameters constant, to isolate the effect of the disaggregation strategy.

Results & Findings

ScenarioHFU (AFD) vs. HFU (EP)Bandwidth Bottleneck?Imbalance Penalty
Standard 10‑GbE cluster, fine‑grained experts≈ 0.45 vs. 0.62Yes – FFN traffic saturates link↑ 15 %
Superpod (100 Gbps) with coarse‑grained experts0.78 vs. 0.71No – bandwidth ample↓ 5 %
High sparsity (≥ 80 %) on any hardwareHFU drops for both, AFD loses edgeYes – less useful data per transfer↑ 20 %
  • Dead Zone: When the number of FFN instances grows, HFU plateaus because the interconnect cannot feed data fast enough; the operator’s active compute time shrinks while latency stays fixed.
  • Imbalance: AFD’s static node‑level expert assignment leads to stragglers, whereas EP can dynamically rebalance batches, reducing idle time.
  • When AFD Wins: Only on systems with very high interconnect bandwidth and models where each expert processes a relatively large chunk of data (coarse granularity, low sparsity).

Practical Implications

  • Hardware Procurement: Teams planning to run massive MoE models should prioritize interconnect bandwidth (e.g., 100 Gbps+ InfiniBand) if they want to exploit AFD. Investing in faster NICs may yield more performance than simply adding more GPUs.
  • Model Design: Architects can deliberately design experts to be coarser (larger hidden dimensions, fewer experts) when targeting AFD‑friendly hardware, trading off some sparsity for better throughput.
  • Scheduler Enhancements: Existing cluster schedulers can incorporate the paper’s imbalance metrics to decide whether to allocate a job to an AFD‑optimized node pool or fall back to EP.
  • Cost‑Benefit Analysis: For cloud providers, offering “AFD‑ready” instance types (high‑speed fabric + balanced GPU‑to‑CPU ratio) could enable premium pricing for customers with suitable MoE workloads.
  • Software Stack: Frameworks (e.g., PyTorch, TensorFlow) may expose a switch to enable AFD mode, automatically selecting the appropriate communication primitives based on detected bandwidth.

Limitations & Future Work

  • Scope of Benchmarks: Experiments focus on a limited set of MoE configurations; ultra‑large models (hundreds of billions of parameters) remain untested.
  • Static Expert Placement: The current AFD implementation assumes a fixed mapping of experts to nodes, which exacerbates imbalance; dynamic placement strategies could mitigate this.
  • Energy Considerations: The study does not evaluate power efficiency, an important factor for large‑scale deployments.
  • Future Directions: The authors suggest exploring hybrid schemes that combine AFD’s disaggregation with EP’s dynamic batching, as well as extending the communication roofline to heterogeneous clusters (CPU‑GPU‑TPU mixes).

Authors

  • Guowei Liu
  • Hongming Li
  • Yaning Guo
  • Yongxi Lyu
  • Mo Zhou
  • Yi Liu
  • Zhaogeng Li
  • Yanpeng Wang

Paper Information

  • arXiv ID: 2602.09721v1
  • Categories: cs.DC
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »