[Paper] OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
Source: arXiv - 2512.03927v1
Overview
The paper introduces OD‑MoE, a novel inference framework that lets Mixture‑of‑Experts (MoE) language models run on tiny edge devices without needing a dedicated GPU cache for expert weights. By loading experts on‑demand from a pool of distributed nodes and predicting which experts will be needed ahead of time, OD‑MoE slashes GPU memory requirements while keeping inference speed at a practical level.
Key Contributions
- Cache‑free expert loading: Eliminates the traditional GPU‑resident expert cache, enabling MoE inference on GPUs with < 1 GB memory.
- Distributed on‑demand loading: Parallelizes both expert transfer and computation across multiple edge nodes, so the next expert is ready just before it is needed.
- Ultra‑accurate emulative predictor: A lightweight predictor forecasts expert activations several layers in advance with 99.94 % accuracy, far outperforming prior offloading schemes.
- Comprehensive benchmark: Shows OD‑MoE reaches ~75 % of the decoding throughput of a fully cached MoE while using only one‑third of the GPU memory, validated on a 10‑node testbed.
Methodology
- System Architecture – OD‑MoE treats a cluster of edge devices as a shared memory fabric. Each node holds a slice of the total expert pool in its CPU memory.
- Parallel Loading & Execution – While the current layer’s experts are being computed on the GPU, a background thread streams the next‑needed experts from the remote nodes to the GPU. As soon as an expert finishes its forward pass, it is evicted, freeing space for the upcoming one.
- Emulative Prediction Engine – Instead of waiting for the routing decision at each layer, OD‑MoE runs a tiny “emulator” that mimics the routing logic ahead of time (e.g., 2‑3 layers forward). This emulator uses only the input token embeddings and the routing logits, which are cheap to compute, to predict the exact set of experts that will be activated later.
- Just‑In‑Time (JIT) Scheduling – The predictor’s output drives a scheduler that assigns each upcoming expert to the node that can deliver it fastest, balancing network latency and GPU memory pressure.
The whole pipeline is designed to be non‑blocking: GPU compute never stalls waiting for data, and network traffic is overlapped with model execution.
Results & Findings
| Metric | OD‑MoE | Prior Offloading (e.g., DeepSpeed‑MoE) | Fully Cached MoE |
|---|---|---|---|
| Expert activation prediction accuracy | 99.94 % | ~85 % | N/A (always correct) |
| Decoding speed (tokens/s) | 0.75× of fully cached | 0.45× | 1× |
| GPU memory usage | ≈ 1/3 of fully cached | 0.5× | 1× |
| Minimum GPU memory to run MoE | < 1 GB | ~2 GB | > 3 GB |
Key takeaways
- The predictor’s near‑perfect accuracy means almost no mis‑predicted expert loads, avoiding costly rollbacks.
- Overlapping transfer and compute recovers most of the speed lost by not caching experts.
- Memory savings are dramatic, opening MoE deployment to commodity edge GPUs (e.g., Jetson Nano, RTX 3050).
Practical Implications
- Edge AI services: Developers can now host sophisticated LLM‑style assistants on low‑cost IoT gateways, enabling on‑device privacy‑preserving inference without cloud round‑trips.
- Scalable inference farms: A fleet of cheap edge nodes can collectively serve a large MoE model, reducing reliance on expensive, high‑memory GPU servers.
- Dynamic workload balancing: The JIT scheduler can be extended to consider power budgets or network congestion, making OD‑MoE suitable for mobile or battery‑operated devices.
- Simplified deployment pipelines: No need to pre‑select “popular” experts for caching; the system automatically learns activation patterns at runtime, lowering engineering overhead.
Limitations & Future Work
- Network dependency: Performance degrades if inter‑node bandwidth or latency spikes; the paper assumes a high‑speed LAN.
- Predictor overhead: Although lightweight, the emulative predictor adds extra compute that may become noticeable on ultra‑low‑power CPUs.
- Scalability beyond 10 nodes: Experiments stop at ten nodes; larger clusters could introduce scheduling complexity and contention.
- Model types: The study focuses on MoE‑based LLMs; applying the same on‑demand loading to other sparse architectures (e.g., Switch Transformers) remains an open question.
Future research directions include adaptive bandwidth‑aware scheduling, integration with heterogeneous accelerators (TPU, NPU), and extending the predictor to handle dynamic routing policies that evolve during fine‑tuning.
Authors
- Liujianfu Wang
- Yuyang Du
- Yuchen Pan
- Soung Chang Liew
- Jiacheng Liu
- Kexin Chen
Paper Information
- arXiv ID: 2512.03927v1
- Categories: cs.DC
- Published: December 3, 2025
- PDF: Download PDF