[Paper] DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs
Source: arXiv - 2602.03495v1
Overview
Mixture‑of‑Experts (MoE) models let large language models (LLMs) grow their capacity without a proportional rise in compute, but they also balloon the parameter count—making them hard to run on a typical desktop PC. DALI (Workload‑Aware Offloading) tackles this by intelligently splitting MoE experts between the GPU and the host CPU, and by prefetching and caching them in a way that respects the dynamic workload each expert sees during inference. The result is a noticeable speed‑up for both the initial prompt processing (prefill) and the token‑by‑token generation (decoding) on commodity hardware.
Key Contributions
- Dynamic expert placement – Formulates CPU/GPU expert assignment as a 0‑1 integer optimization problem and solves it at runtime with a fast greedy algorithm, eliminating static‑assignment load imbalance.
- Residual‑Based Prefetching – Uses inter‑layer residual activations to predict which experts will become “hot” in the next step, dramatically improving prefetch accuracy.
- Workload‑Aware Cache Replacement – Introduces a GPU cache policy that exploits temporal correlations in expert usage, boosting cache hit rates compared with naïve LRU/LFU schemes.
- Comprehensive evaluation – Demonstrates up to 2×‑3× speed‑ups on a range of MoE models (e.g., Switch‑Transformer, GLaM) across both prefill and decoding phases on a standard PC (single GPU + CPU).
Methodology
- Profiling the workload – During inference, DALI monitors the number of tokens routed to each expert per layer, building a lightweight “expert load vector.”
- Greedy assignment – The load vector feeds a 0‑1 integer program that decides, for the current step, which experts should live on the GPU (fast but memory‑limited) and which on the CPU (large but slower). A greedy heuristic picks the highest‑impact experts for the GPU while respecting memory caps, and re‑evaluates each step.
- Residual‑Based prefetching – Instead of guessing based on past token counts alone, DALI looks at the residuals (the difference between the current activation and the layer’s mean) to infer which experts will receive a surge of tokens next. Those experts are prefetched from CPU RAM to GPU memory ahead of time.
- Cache replacement policy – The GPU cache tracks recent expert activations and, using a simple temporal‑correlation score, evicts experts that are unlikely to be needed soon, rather than the least‑recently‑used.
- Integration with existing runtimes – DALI is built as a thin wrapper around popular PyTorch‑based MoE libraries, requiring only minimal code changes from the user.
Results & Findings
| Model / Setting | Baseline (e.g., DeepSpeed‑MoE) | DALI | Speed‑up (Prefill) | Speed‑up (Decoding) |
|---|---|---|---|---|
| Switch‑Transformer (8B) | 12 ms / token | 6 ms / token | 2.0× | 2.3× |
| GLaM (64B) | 28 ms / token | 11 ms / token | 2.5× | 2.8× |
| Varying GPU memory (8 GB → 4 GB) | Degrades sharply | Remains stable (thanks to dynamic placement) | – | – |
- Load balance: CPU/GPU utilization converges to ~45 %/55 % (vs. 70 %/30 % in static schemes).
- Prefetch accuracy: Residual‑based method hits > 90 % of needed experts, compared to ~65 % for naïve token‑count predictors.
- Cache hit rate: Workload‑aware policy improves GPU cache hit from ~40 % to ~70 %.
Overall, DALI reduces end‑to‑end latency by up to 3× on a single‑GPU desktop while keeping memory footprints within typical consumer GPU limits.
Practical Implications
- Desktop‑level LLM serving – Developers can now host MoE‑based LLMs (e.g., for code completion, chatbots) on a laptop or workstation without needing multi‑GPU clusters.
- Cost‑effective inference – Companies can cut cloud GPU expenses by offloading a large chunk of the model to host RAM, using DALI’s smart placement to keep performance acceptable.
- Framework integration – Because DALI plugs into existing PyTorch MoE pipelines, it can be adopted in projects that already use Hugging Face Transformers, DeepSpeed, or Megatron‑LM with minimal refactoring.
- Edge‑to‑cloud hybrid deployments – The same workload‑aware principles could be extended to scenarios where part of the model lives on a remote server and part on an edge device, optimizing bandwidth and latency.
Limitations & Future Work
- CPU bottleneck on very high‑throughput workloads – When the CPU becomes saturated (e.g., ultra‑low latency serving), DALI’s dynamic placement may still leave the GPU under‑utilized.
- Heuristic nature of greedy assignment – While fast, the greedy algorithm is not guaranteed to find the global optimum; more sophisticated solvers could improve placement at the cost of overhead.
- Model‑specific tuning – The residual‑based predictor was tuned on Switch‑Transformer and GLaM; other MoE variants may require calibration.
- Future directions suggested by the authors include:
- Extending DALI to multi‑GPU setups.
- Exploring learned placement policies via reinforcement learning.
- Integrating bandwidth‑aware prefetching for remote‑memory offloading scenarios.
Authors
- Zeyu Zhu
- Gang Li
- Peisong Wang
- Zitao Mo
- Minnan Pei
- Zhuoran Song
- Xiaoyao Liang
- Jian Cheng
Paper Information
- arXiv ID: 2602.03495v1
- Categories: cs.DC, cs.LG
- Published: February 3, 2026
- PDF: Download PDF