[Paper] DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

Published: 3 months ago (February 3, 2026 at 08:11 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.03495v1

Overview

Mixture‑of‑Experts (MoE) models let large language models (LLMs) grow their capacity without a proportional rise in compute, but they also balloon the parameter count—making them hard to run on a typical desktop PC. DALI (Workload‑Aware Offloading) tackles this by intelligently splitting MoE experts between the GPU and the host CPU, and by prefetching and caching them in a way that respects the dynamic workload each expert sees during inference. The result is a noticeable speed‑up for both the initial prompt processing (prefill) and the token‑by‑token generation (decoding) on commodity hardware.

Key Contributions

Dynamic expert placement – Formulates CPU/GPU expert assignment as a 0‑1 integer optimization problem and solves it at runtime with a fast greedy algorithm, eliminating static‑assignment load imbalance.
Residual‑Based Prefetching – Uses inter‑layer residual activations to predict which experts will become “hot” in the next step, dramatically improving prefetch accuracy.
Workload‑Aware Cache Replacement – Introduces a GPU cache policy that exploits temporal correlations in expert usage, boosting cache hit rates compared with naïve LRU/LFU schemes.
Comprehensive evaluation – Demonstrates up to 2×‑3× speed‑ups on a range of MoE models (e.g., Switch‑Transformer, GLaM) across both prefill and decoding phases on a standard PC (single GPU + CPU).

Methodology

Profiling the workload – During inference, DALI monitors the number of tokens routed to each expert per layer, building a lightweight “expert load vector.”
Greedy assignment – The load vector feeds a 0‑1 integer program that decides, for the current step, which experts should live on the GPU (fast but memory‑limited) and which on the CPU (large but slower). A greedy heuristic picks the highest‑impact experts for the GPU while respecting memory caps, and re‑evaluates each step.
Residual‑Based prefetching – Instead of guessing based on past token counts alone, DALI looks at the residuals (the difference between the current activation and the layer’s mean) to infer which experts will receive a surge of tokens next. Those experts are prefetched from CPU RAM to GPU memory ahead of time.
Cache replacement policy – The GPU cache tracks recent expert activations and, using a simple temporal‑correlation score, evicts experts that are unlikely to be needed soon, rather than the least‑recently‑used.
Integration with existing runtimes – DALI is built as a thin wrapper around popular PyTorch‑based MoE libraries, requiring only minimal code changes from the user.

Results & Findings

Model / Setting	Baseline (e.g., DeepSpeed‑MoE)	DALI	Speed‑up (Prefill)	Speed‑up (Decoding)
Switch‑Transformer (8B)	12 ms / token	6 ms / token	2.0×	2.3×
GLaM (64B)	28 ms / token	11 ms / token	2.5×	2.8×
Varying GPU memory (8 GB → 4 GB)	Degrades sharply	Remains stable (thanks to dynamic placement)	–	–

Load balance: CPU/GPU utilization converges to ~45 %/55 % (vs. 70 %/30 % in static schemes).
Prefetch accuracy: Residual‑based method hits > 90 % of needed experts, compared to ~65 % for naïve token‑count predictors.
Cache hit rate: Workload‑aware policy improves GPU cache hit from ~40 % to ~70 %.

Overall, DALI reduces end‑to‑end latency by up to 3× on a single‑GPU desktop while keeping memory footprints within typical consumer GPU limits.

Practical Implications

Desktop‑level LLM serving – Developers can now host MoE‑based LLMs (e.g., for code completion, chatbots) on a laptop or workstation without needing multi‑GPU clusters.
Cost‑effective inference – Companies can cut cloud GPU expenses by offloading a large chunk of the model to host RAM, using DALI’s smart placement to keep performance acceptable.
Framework integration – Because DALI plugs into existing PyTorch MoE pipelines, it can be adopted in projects that already use Hugging Face Transformers, DeepSpeed, or Megatron‑LM with minimal refactoring.
Edge‑to‑cloud hybrid deployments – The same workload‑aware principles could be extended to scenarios where part of the model lives on a remote server and part on an edge device, optimizing bandwidth and latency.

Limitations & Future Work

CPU bottleneck on very high‑throughput workloads – When the CPU becomes saturated (e.g., ultra‑low latency serving), DALI’s dynamic placement may still leave the GPU under‑utilized.
Heuristic nature of greedy assignment – While fast, the greedy algorithm is not guaranteed to find the global optimum; more sophisticated solvers could improve placement at the cost of overhead.
Model‑specific tuning – The residual‑based predictor was tuned on Switch‑Transformer and GLaM; other MoE variants may require calibration.
Future directions suggested by the authors include:
1. Extending DALI to multi‑GPU setups.
2. Exploring learned placement policies via reinforcement learning.
3. Integrating bandwidth‑aware prefetching for remote‑memory offloading scenarios.

Authors

Zeyu Zhu
Gang Li
Peisong Wang
Zitao Mo
Minnan Pei
Zhuoran Song
Xiaoyao Liang
Jian Cheng

Paper Information

arXiv ID: 2602.03495v1
Categories: cs.DC, cs.LG
Published: February 3, 2026
PDF: Download PDF

[Paper] DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data