[Paper] A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems
Source: arXiv - 2601.03992v1
Overview
The paper presents a new inference framework that makes Mixture‑of‑Experts (MoE) models run efficiently on edge devices equipped with GPU‑NDP (Near‑Data Processing) architectures. By tackling load‑imbalance, GPU under‑utilization, and costly expert pre‑fetching, the authors achieve more than 2× speed‑up in latency, opening the door for sophisticated AI workloads on power‑constrained edge hardware.
Key Contributions
- Tensor‑parallel MoE inference: Leverages under‑used tensor parallelism to split massive expert weights across several NDP units, enabling low‑batch edge inference.
- Load‑balancing‑aware scheduler: Dynamically assigns expert computations between the GPU and NDP cores, maximizing overall hardware utilization despite the irregular expert activation patterns.
- Dataset‑free pre‑fetching: Introduces a lightweight, statistics‑driven strategy that predicts and loads the most‑likely experts ahead of time, eliminating expensive profiling passes.
- Comprehensive evaluation: Demonstrates average 2.41× (peak 2.56×) end‑to‑end latency reduction on real‑world MoE models compared with the best existing edge‑GPU‑NDP baselines.
Methodology
Tensor Parallelism for Experts
- Traditional MoE inference runs each selected expert on a single NDP unit, leading to idle resources when only a few experts fire.
- The authors partition the weight matrix of each expert across multiple NDP cores (similar to model parallelism in large language models). This allows many NDP units to collaborate on a single expert, keeping them busy even when batch sizes are tiny—a common scenario on edge devices.
Scheduling Algorithm
- The scheduler first profiles the expert activation distribution for a given input batch (using a cheap runtime histogram).
- It then solves a lightweight bin‑packing problem that maps expert shards to NDP units and the GPU, aiming to equalize the compute load while respecting memory constraints.
- The schedule is recomputed only when the activation pattern changes significantly, keeping overhead low.
Dataset‑Free Pre‑fetching
- Instead of running a full data‑driven profiling phase, the system maintains a running count of how often each expert is selected.
- Frequently accessed experts are proactively copied into the NDP’s local memory before inference begins, reducing the “cold‑start” latency for those experts.
Implementation Details
- Built on top of a CUDA‑compatible GPU‑NDP prototype where each NDP core has a small on‑chip SRAM and a dedicated compute pipeline.
- Uses standard CUDA kernels for the GPU side and custom micro‑kernels for the NDP side, orchestrated via a lightweight runtime library.
Results & Findings
| Metric | Baseline (state‑of‑the‑art) | Proposed Framework |
|---|---|---|
| End‑to‑end latency (average) | 1.00× (reference) | 2.41× speed‑up |
| Peak latency improvement | – | 2.56× |
| GPU utilization (during expert compute) | ~35 % | ~78 % |
| NDP unit load variance (std. dev.) | High (imbalanced) | Low (balanced) |
| Pre‑fetch overhead | Requires full dataset profiling | Negligible (online stats) |
The experiments span several MoE configurations (2–8 experts, hidden sizes 1K–4K) and realistic edge workloads (speech recognition, recommendation). The framework consistently reduces the tail latency that typically bottlenecks edge AI services.
Practical Implications
- Edge AI services (e.g., voice assistants, on‑device recommendation) can now run larger, more accurate MoE models without sacrificing response time.
- Developer tooling: The scheduling library can be integrated into existing inference stacks (TensorRT, ONNX Runtime) to automatically exploit NDP hardware, abstracting away the complexity of tensor parallelism.
- Hardware design guidance: Shows that modest on‑chip memory in NDP units combined with a smart scheduler yields outsized performance gains, informing next‑generation edge GPU‑NDP chip designs.
- Cost & power savings: Higher hardware utilization translates to lower idle power draw, extending battery life for mobile and IoT devices that host AI workloads.
Limitations & Future Work
- The current scheduler assumes a relatively static expert activation distribution; rapid shifts in input domains may require more frequent rescheduling, adding overhead.
- Experiments are limited to a prototype NDP platform; scaling to commercial edge GPUs with different memory hierarchies may expose new bottlenecks.
- The pre‑fetching strategy relies on a simple frequency count; more sophisticated predictive models (e.g., reinforcement learning) could further reduce miss rates.
- Future work includes extending the framework to support dynamic MoE routing (where experts are chosen on‑the‑fly) and exploring cross‑device scheduling for multi‑edge scenarios.
Authors
- Qi Wu
- Chao Fang
- Jiayuan Chen
- Ye Lin
- Yueqi Zhang
- Yichuan Bai
- Yuan Du
- Li Du
Paper Information
- arXiv ID: 2601.03992v1
- Categories: cs.DC, cs.AI
- Published: January 7, 2026
- PDF: Download PDF