[Paper] DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

Published: (February 3, 2026 at 08:11 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.03495v1

Overview

Mixture‑of‑Experts (MoE) models let large language models (LLMs) grow their capacity without a proportional rise in compute, but they also balloon the parameter count—making them hard to run on a typical desktop PC. DALI (Workload‑Aware Offloading) tackles this by intelligently splitting MoE experts between the GPU and the host CPU, and by prefetching and caching them in a way that respects the dynamic workload each expert sees during inference. The result is a noticeable speed‑up for both the initial prompt processing (prefill) and the token‑by‑token generation (decoding) on commodity hardware.

Key Contributions

  • Dynamic expert placement – Formulates CPU/GPU expert assignment as a 0‑1 integer optimization problem and solves it at runtime with a fast greedy algorithm, eliminating static‑assignment load imbalance.
  • Residual‑Based Prefetching – Uses inter‑layer residual activations to predict which experts will become “hot” in the next step, dramatically improving prefetch accuracy.
  • Workload‑Aware Cache Replacement – Introduces a GPU cache policy that exploits temporal correlations in expert usage, boosting cache hit rates compared with naïve LRU/LFU schemes.
  • Comprehensive evaluation – Demonstrates up to 2×‑3× speed‑ups on a range of MoE models (e.g., Switch‑Transformer, GLaM) across both prefill and decoding phases on a standard PC (single GPU + CPU).

Methodology

  1. Profiling the workload – During inference, DALI monitors the number of tokens routed to each expert per layer, building a lightweight “expert load vector.”
  2. Greedy assignment – The load vector feeds a 0‑1 integer program that decides, for the current step, which experts should live on the GPU (fast but memory‑limited) and which on the CPU (large but slower). A greedy heuristic picks the highest‑impact experts for the GPU while respecting memory caps, and re‑evaluates each step.
  3. Residual‑Based prefetching – Instead of guessing based on past token counts alone, DALI looks at the residuals (the difference between the current activation and the layer’s mean) to infer which experts will receive a surge of tokens next. Those experts are prefetched from CPU RAM to GPU memory ahead of time.
  4. Cache replacement policy – The GPU cache tracks recent expert activations and, using a simple temporal‑correlation score, evicts experts that are unlikely to be needed soon, rather than the least‑recently‑used.
  5. Integration with existing runtimes – DALI is built as a thin wrapper around popular PyTorch‑based MoE libraries, requiring only minimal code changes from the user.

Results & Findings

Model / SettingBaseline (e.g., DeepSpeed‑MoE)DALISpeed‑up (Prefill)Speed‑up (Decoding)
Switch‑Transformer (8B)12 ms / token6 ms / token2.0×2.3×
GLaM (64B)28 ms / token11 ms / token2.5×2.8×
Varying GPU memory (8 GB → 4 GB)Degrades sharplyRemains stable (thanks to dynamic placement)
  • Load balance: CPU/GPU utilization converges to ~45 %/55 % (vs. 70 %/30 % in static schemes).
  • Prefetch accuracy: Residual‑based method hits > 90 % of needed experts, compared to ~65 % for naïve token‑count predictors.
  • Cache hit rate: Workload‑aware policy improves GPU cache hit from ~40 % to ~70 %.

Overall, DALI reduces end‑to‑end latency by up to on a single‑GPU desktop while keeping memory footprints within typical consumer GPU limits.

Practical Implications

  • Desktop‑level LLM serving – Developers can now host MoE‑based LLMs (e.g., for code completion, chatbots) on a laptop or workstation without needing multi‑GPU clusters.
  • Cost‑effective inference – Companies can cut cloud GPU expenses by offloading a large chunk of the model to host RAM, using DALI’s smart placement to keep performance acceptable.
  • Framework integration – Because DALI plugs into existing PyTorch MoE pipelines, it can be adopted in projects that already use Hugging Face Transformers, DeepSpeed, or Megatron‑LM with minimal refactoring.
  • Edge‑to‑cloud hybrid deployments – The same workload‑aware principles could be extended to scenarios where part of the model lives on a remote server and part on an edge device, optimizing bandwidth and latency.

Limitations & Future Work

  • CPU bottleneck on very high‑throughput workloads – When the CPU becomes saturated (e.g., ultra‑low latency serving), DALI’s dynamic placement may still leave the GPU under‑utilized.
  • Heuristic nature of greedy assignment – While fast, the greedy algorithm is not guaranteed to find the global optimum; more sophisticated solvers could improve placement at the cost of overhead.
  • Model‑specific tuning – The residual‑based predictor was tuned on Switch‑Transformer and GLaM; other MoE variants may require calibration.
  • Future directions suggested by the authors include:
    1. Extending DALI to multi‑GPU setups.
    2. Exploring learned placement policies via reinforcement learning.
    3. Integrating bandwidth‑aware prefetching for remote‑memory offloading scenarios.

Authors

  • Zeyu Zhu
  • Gang Li
  • Peisong Wang
  • Zitao Mo
  • Minnan Pei
  • Zhuoran Song
  • Xiaoyao Liang
  • Jian Cheng

Paper Information

  • arXiv ID: 2602.03495v1
  • Categories: cs.DC, cs.LG
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »