[Paper] OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Published: 2 months ago (December 3, 2025 at 11:27 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03927v1

Overview

The paper introduces OD‑MoE, a novel inference framework that lets Mixture‑of‑Experts (MoE) language models run on tiny edge devices without needing a dedicated GPU cache for expert weights. By loading experts on‑demand from a pool of distributed nodes and predicting which experts will be needed ahead of time, OD‑MoE slashes GPU memory requirements while keeping inference speed at a practical level.

Key Contributions

Cache‑free expert loading: Eliminates the traditional GPU‑resident expert cache, enabling MoE inference on GPUs with < 1 GB memory.
Distributed on‑demand loading: Parallelizes both expert transfer and computation across multiple edge nodes, so the next expert is ready just before it is needed.
Ultra‑accurate emulative predictor: A lightweight predictor forecasts expert activations several layers in advance with 99.94 % accuracy, far outperforming prior offloading schemes.
Comprehensive benchmark: Shows OD‑MoE reaches ~75 % of the decoding throughput of a fully cached MoE while using only one‑third of the GPU memory, validated on a 10‑node testbed.

Methodology

System Architecture – OD‑MoE treats a cluster of edge devices as a shared memory fabric. Each node holds a slice of the total expert pool in its CPU memory.
Parallel Loading & Execution – While the current layer’s experts are being computed on the GPU, a background thread streams the next‑needed experts from the remote nodes to the GPU. As soon as an expert finishes its forward pass, it is evicted, freeing space for the upcoming one.
Emulative Prediction Engine – Instead of waiting for the routing decision at each layer, OD‑MoE runs a tiny “emulator” that mimics the routing logic ahead of time (e.g., 2‑3 layers forward). This emulator uses only the input token embeddings and the routing logits, which are cheap to compute, to predict the exact set of experts that will be activated later.
Just‑In‑Time (JIT) Scheduling – The predictor’s output drives a scheduler that assigns each upcoming expert to the node that can deliver it fastest, balancing network latency and GPU memory pressure.

The whole pipeline is designed to be non‑blocking: GPU compute never stalls waiting for data, and network traffic is overlapped with model execution.

Results & Findings

Metric	OD‑MoE	Prior Offloading (e.g., DeepSpeed‑MoE)	Fully Cached MoE
Expert activation prediction accuracy	99.94 %	~85 %	N/A (always correct)
Decoding speed (tokens/s)	0.75× of fully cached	0.45×	1×
GPU memory usage	≈ 1/3 of fully cached	0.5×	1×
Minimum GPU memory to run MoE	< 1 GB	~2 GB	> 3 GB

Key takeaways:

The predictor’s near‑perfect accuracy means almost no mis‑predicted expert loads, avoiding costly rollbacks.
Overlapping transfer and compute recovers most of the speed lost by not caching experts.
Memory savings are dramatic, opening MoE deployment to commodity edge GPUs (e.g., Jetson Nano, RTX 3050).

Practical Implications

Edge AI services: Developers can now host sophisticated LLM‑style assistants on low‑cost IoT gateways, enabling on‑device privacy‑preserving inference without cloud round‑trips.
Scalable inference farms: A fleet of cheap edge nodes can collectively serve a large MoE model, reducing reliance on expensive, high‑memory GPU servers.
Dynamic workload balancing: The JIT scheduler can be extended to consider power budgets or network congestion, making OD‑MoE suitable for mobile or battery‑operated devices.
Simplified deployment pipelines: No need to pre‑select “popular” experts for caching; the system automatically learns activation patterns at runtime, lowering engineering overhead.

Limitations & Future Work

Network dependency: Performance degrades if inter‑node bandwidth or latency spikes; the paper assumes a high‑speed LAN.
Predictor overhead: Although lightweight, the emulative predictor adds extra compute that may become noticeable on ultra‑low‑power CPUs.
Scalability beyond 10 nodes: Experiments stop at ten nodes; larger clusters could introduce scheduling complexity and contention.
Model types: The study focuses on MoE‑based LLMs; applying the same on‑demand loading to other sparse architectures (e.g., Switch Transformers) remains an open question.

Future research directions include adaptive bandwidth‑aware scheduling, integration with heterogeneous accelerators (TPU, NPU), and extending the predictor to handle dynamic routing policies that evolve during fine‑tuning.

Authors

Liujianfu Wang
Yuyang Du
Yuchen Pan
Soung Chang Liew
Jiacheng Liu
Kexin Chen

Paper Information

arXiv ID: 2512.03927v1
Categories: cs.DC
Published: December 3, 2025
PDF: Download PDF

[Paper] OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity