[Paper] LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training
Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device proc...