[Paper] MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training
Source: arXiv - 2511.21431v1
Overview
Training massive Mixture‑of‑Experts (MoE) models is hitting a hard wall: the dynamic routing of tokens creates severe load imbalance, which in turn blows up GPU memory usage. MemFine introduces a memory‑aware, fine‑grained scheduling system that slices both token streams and expert workloads into smaller “chunks,” allowing the trainer to recompute only what’s needed while staying within the memory limits of typical GPUs. The result is a more scalable MoE training pipeline that can run on hardware that previously ran out of memory.
Key Contributions
- Chunk‑Based Decomposition: Breaks down token distribution and expert computation into manageable pieces, enabling selective recomputation.
- Theoretical Memory Model: Provides a closed‑form expression to predict memory consumption and guide the scheduler’s decisions in real time.
- Dynamic Scheduling Algorithm: Optimizes the trade‑off between memory savings and compute throughput on the fly, without manual tuning.
- Empirical Gains: Demonstrates up to 48 % reduction in activation memory and a 4.4 % throughput boost over baseline full‑recomputation methods.
- Hardware‑Friendly Design: Works on commodity GPUs with limited memory, removing the need for exotic hardware or massive batch‑size reductions.
Methodology
- Token & Expert Chunking – Instead of treating the entire batch as a monolith, MemFine partitions the incoming token stream and the set of experts into smaller chunks. Each chunk can be processed independently, which limits the peak memory needed for any single operation.
- Chunked Recomputation – For layers where activations would normally be stored for the backward pass, MemFine selectively discards them and recomputes only the necessary chunks during back‑propagation. This is guided by the memory model to ensure that the recomputation cost does not outweigh the memory savings.
- Memory Model‑Driven Scheduler – A lightweight analytical model estimates the memory footprint of each possible chunk configuration. The scheduler then picks the configuration that satisfies the GPU memory budget while maximizing throughput. The decision is updated each training step, adapting to the ever‑changing token routing patterns of MoE.
- Implementation Details – Integrated into popular deep‑learning frameworks (e.g., PyTorch) as a drop‑in replacement for the standard MoE dispatcher, requiring only minimal code changes from the user side.
Results & Findings
| Metric | Baseline (full recompute) | MemFine |
|---|---|---|
| Activation memory (peak) | 100 % | 52 % (≈ 48 % reduction) |
| Training throughput (tokens/s) | 100 % | 104.4 % (≈ 4.4 % gain) |
| Model accuracy (GLUE benchmark) | 78.2 % | 78.0 % (negligible drop) |
- Memory Savings: By cutting activation memory almost in half, models that previously crashed on 16 GB GPUs now train stably.
- Throughput: The fine‑grained scheduling adds only a tiny recomputation overhead, which is outweighed by the reduced memory‑induced stalls.
- Accuracy: Because MemFine only recomputes exact forward passes for the discarded activations, there is virtually no impact on final model quality.
Practical Implications
- Cost‑Effective Scaling: Companies can push MoE models to billions of parameters without buying expensive multi‑GPU servers; a single 16‑32 GB GPU suffices for many workloads.
- Simpler DevOps: No need to manually tune expert capacities or batch sizes to fit memory—MemFine’s scheduler handles it automatically, reducing engineering overhead.
- Broader Accessibility: Researchers and startups with limited hardware budgets can experiment with state‑of‑the‑art MoE architectures that were previously out of reach.
- Integration Path: Since MemFine plugs into existing MoE libraries, developers can adopt it with a few lines of code, gaining immediate memory benefits without rewriting model logic.
Limitations & Future Work
- Recomputation Overhead on Very Large Batches: While the scheduler mitigates it, extremely large batch sizes can still cause noticeable recomputation latency.
- Hardware Diversity: The current evaluation focuses on NVIDIA GPUs; extending the approach to TPUs or AMD GPUs may require additional tuning.
- Dynamic Expert Count: MemFine assumes a static number of experts per layer; handling models that add/remove experts during training is an open challenge.
- Future Directions: The authors plan to explore adaptive chunk sizes based on runtime profiling, combine MemFine with mixed‑precision training, and open‑source a more generalized scheduler API for broader community adoption.
Authors
- Lu Zhao
- Rong Shi
- Shaoqing Zhang
- Yueqiang Chen
- Baoguo He
- Hongfeng Sun
- Ziqing Yin
- Shangchao Su
- Zhiyan Cui
- Liang Dong
- Xiyuan Li
- Lingbin Wang
- Jianwei He
- Jiesong Ma
- Weikang Huang
- Jianglei Tong
- Dongdong Gao
- Jian Zhang
- Hong Tian
Paper Information
- arXiv ID: 2511.21431v1
- Categories: cs.DC
- Published: November 26, 2025
- PDF: Download PDF