[Paper] MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training

Published: 2 months ago (November 26, 2025 at 09:22 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21431v1

Overview

Training massive Mixture‑of‑Experts (MoE) models is hitting a hard wall: the dynamic routing of tokens creates severe load imbalance, which in turn blows up GPU memory usage. MemFine introduces a memory‑aware, fine‑grained scheduling system that slices both token streams and expert workloads into smaller “chunks,” allowing the trainer to recompute only what’s needed while staying within the memory limits of typical GPUs. The result is a more scalable MoE training pipeline that can run on hardware that previously ran out of memory.

Key Contributions

Chunk‑Based Decomposition: Breaks down token distribution and expert computation into manageable pieces, enabling selective recomputation.
Theoretical Memory Model: Provides a closed‑form expression to predict memory consumption and guide the scheduler’s decisions in real time.
Dynamic Scheduling Algorithm: Optimizes the trade‑off between memory savings and compute throughput on the fly, without manual tuning.
Empirical Gains: Demonstrates up to 48 % reduction in activation memory and a 4.4 % throughput boost over baseline full‑recomputation methods.
Hardware‑Friendly Design: Works on commodity GPUs with limited memory, removing the need for exotic hardware or massive batch‑size reductions.

Methodology

Token & Expert Chunking – Instead of treating the entire batch as a monolith, MemFine partitions the incoming token stream and the set of experts into smaller chunks. Each chunk can be processed independently, which limits the peak memory needed for any single operation.
Chunked Recomputation – For layers where activations would normally be stored for the backward pass, MemFine selectively discards them and recomputes only the necessary chunks during back‑propagation. This is guided by the memory model to ensure that the recomputation cost does not outweigh the memory savings.
Memory Model‑Driven Scheduler – A lightweight analytical model estimates the memory footprint of each possible chunk configuration. The scheduler then picks the configuration that satisfies the GPU memory budget while maximizing throughput. The decision is updated each training step, adapting to the ever‑changing token routing patterns of MoE.
Implementation Details – Integrated into popular deep‑learning frameworks (e.g., PyTorch) as a drop‑in replacement for the standard MoE dispatcher, requiring only minimal code changes from the user side.

Results & Findings

Metric	Baseline (full recompute)	MemFine
Activation memory (peak)	100 %	52 % (≈ 48 % reduction)
Training throughput (tokens/s)	100 %	104.4 % (≈ 4.4 % gain)
Model accuracy (GLUE benchmark)	78.2 %	78.0 % (negligible drop)

Memory Savings: By cutting activation memory almost in half, models that previously crashed on 16 GB GPUs now train stably.
Throughput: The fine‑grained scheduling adds only a tiny recomputation overhead, which is outweighed by the reduced memory‑induced stalls.
Accuracy: Because MemFine only recomputes exact forward passes for the discarded activations, there is virtually no impact on final model quality.

Practical Implications

Cost‑Effective Scaling: Companies can push MoE models to billions of parameters without buying expensive multi‑GPU servers; a single 16‑32 GB GPU suffices for many workloads.
Simpler DevOps: No need to manually tune expert capacities or batch sizes to fit memory—MemFine’s scheduler handles it automatically, reducing engineering overhead.
Broader Accessibility: Researchers and startups with limited hardware budgets can experiment with state‑of‑the‑art MoE architectures that were previously out of reach.
Integration Path: Since MemFine plugs into existing MoE libraries, developers can adopt it with a few lines of code, gaining immediate memory benefits without rewriting model logic.

Limitations & Future Work

Recomputation Overhead on Very Large Batches: While the scheduler mitigates it, extremely large batch sizes can still cause noticeable recomputation latency.
Hardware Diversity: The current evaluation focuses on NVIDIA GPUs; extending the approach to TPUs or AMD GPUs may require additional tuning.
Dynamic Expert Count: MemFine assumes a static number of experts per layer; handling models that add/remove experts during training is an open challenge.
Future Directions: The authors plan to explore adaptive chunk sizes based on runtime profiling, combine MemFine with mixed‑precision training, and open‑source a more generalized scheduler API for broader community adoption.

Authors

Lu Zhao
Rong Shi
Shaoqing Zhang
Yueqiang Chen
Baoguo He
Hongfeng Sun
Ziqing Yin
Shangchao Su
Zhiyan Cui
Liang Dong
Xiyuan Li
Lingbin Wang
Jianwei He
Jiesong Ma
Weikang Huang
Jianglei Tong
Dongdong Gao
Jian Zhang
Hong Tian

Paper Information

arXiv ID: 2511.21431v1
Categories: cs.DC
Published: November 26, 2025
PDF: Download PDF

[Paper] MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MAD-DAG: Protecting Blockchain Consensus from MEV

[Paper] Modeling the Effect of Data Redundancy on Speedup in MLFMA Near-Field Computation

[Paper] Interactive Visualization of Proof-of-Work Consensus Protocol on Raspberry Pi

[Paper] Parallel simulation and adaptive mesh refinement for 3D elastostatic contact mechanics problems between deformable bodies