[Paper] Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems
Source: arXiv - 2512.16473v1
Overview
Large Language Models (LLMs) have become the backbone of many AI‑driven products, but running them on a typical desktop or laptop is still a nightmare because of the massive memory and compute requirements. Mixture‑of‑Experts (MoE) architectures cut down the compute by activating only a few “expert” sub‑networks per token, yet even the most memory‑efficient MoE models still outgrow the VRAM of consumer GPUs. This paper introduces a CPU‑GPU collaborative inference framework that keeps a hot‑cache of experts on the GPU and falls back to the CPU when a cache miss occurs, dramatically reducing data‑movement latency and making MoE‑based LLMs usable on memory‑limited machines.
Key Contributions
- GPU‑resident expert cache: A lightweight caching layer that stores the most frequently used expert weights on the GPU, turning many inference steps into cache‑hits and avoiding costly weight transfers.
- CPU‑driven miss handling: When an expert is not in the GPU cache, the CPU fetches it, runs the computation using highly‑parallel multithreading, and optionally pushes it into the cache for future reuse.
- Unified scheduling runtime: A scheduler that dynamically decides, per token, whether to run on the GPU (cache‑hit) or offload to the CPU (cache‑miss), balancing latency and throughput.
- Open‑source implementation: The full prototype (including the cache manager, scheduler, and integration with popular MoE libraries) is released on GitHub, enabling reproducibility and community extensions.
- Empirical validation on consumer hardware: Benchmarks on a 16 GB RTX 3060 paired with an 8‑core CPU show up to 2.3× speed‑up over naïve CPU‑only inference and 1.6× speed‑up over a baseline GPU‑offload approach, while keeping memory usage within the GPU’s limits.
Methodology
- Expert Profiling: The system first profiles the MoE model on a representative workload to identify which experts are accessed most often (e.g., based on token distribution).
- Cache Construction: The top‑K experts (where K fits into the available GPU memory) are pre‑loaded onto the GPU at startup. The cache is implemented as a simple hash‑map keyed by expert IDs.
- Dynamic Scheduling: During inference, each token’s routing decision (which expert(s) to activate) is checked against the cache.
- Cache‑hit: The expert’s weights are already on the GPU; the token is processed there with minimal latency.
- Cache‑miss: The request is handed to the CPU. The CPU loads the expert weights from main memory, runs the matrix‑multiply using OpenMP‑based parallelism, and returns the result. Optionally, the miss can trigger a cache replacement (e.g., LRU) to keep the most useful experts on the GPU.
- Synchronization & Overlap: While the CPU works on a miss, the GPU can continue processing subsequent tokens that hit the cache, overlapping compute and data movement to hide latency.
- Evaluation Setup: Experiments were run on a single‑request inference scenario (the most common pattern for chat‑style applications) using popular MoE LLMs (e.g., Switch‑Transformer‑7B) and compared against three baselines: pure CPU, pure GPU with full offload, and a naïve CPU‑GPU offload without caching.
Results & Findings
| Configuration | Peak GPU Memory | Avg. Latency per Token | Throughput (tokens/s) |
|---|---|---|---|
| Pure CPU | < 2 GB | 28 ms | 35 |
| Pure GPU (offload) | 12 GB (full model) | 19 ms | 52 |
| CPU‑GPU offload (no cache) | 8 GB | 15 ms | 66 |
| Proposed CPU‑GPU collaborative (cache‑hit 68 %) | 8 GB | 11 ms | 84 |
- Cache hit rate stabilizes around 65‑70 % after a short warm‑up, confirming that a relatively small subset of experts dominates inference for typical prompts.
- Latency reduction is primarily due to eliminating weight transfer for hits; CPU‑only computation remains competitive thanks to multithreaded BLAS kernels.
- Scalability: When the GPU memory budget is reduced further (e.g., 6 GB), the system gracefully degrades by caching fewer experts, still outperforming the naïve offload baseline.
Practical Implications
- Deploy LLMs on laptops or edge servers: Developers can now run MoE‑based models that would otherwise exceed GPU VRAM, opening up on‑device AI assistants, code‑completion tools, and localized inference for privacy‑sensitive applications.
- Cost‑effective scaling: Enterprises can serve more concurrent users with a mix of modest GPUs and CPUs rather than investing in high‑end A100‑class hardware.
- Framework integration: The cache‑aware scheduler can be wrapped around existing PyTorch or TensorFlow MoE libraries, meaning minimal code changes for teams already using these stacks.
- Energy savings: Offloading to the CPU for infrequent experts reduces GPU idle time and can lower overall power consumption, an attractive metric for data‑center operators.
Limitations & Future Work
- Single‑request focus: The current design optimizes latency for one request at a time; batch inference scenarios (common in serving APIs) may need a different scheduling strategy.
- Cache eviction policy: The paper uses a simple LRU scheme; more sophisticated policies (e.g., learning‑based prediction of expert popularity) could boost hit rates further.
- CPU bottleneck on very large batches: If many cache misses occur simultaneously, the CPU may become a performance choke point; future work could explore heterogeneous offload to multiple CPUs or dedicated inference accelerators.
- Generalization to other MoE variants: The evaluation is limited to a few Switch‑Transformer models; extending the framework to newer sparsely‑gated architectures (e.g., GLaM, Mixtral) will validate broader applicability.
Overall, the CPU‑GPU collaborative inference framework offers a pragmatic pathway to bring powerful MoE‑based LLMs to everyday hardware, turning memory constraints from a show‑stopper into a manageable engineering challenge.
Authors
- En-Ming Huang
- Li-Shang Lin
- Chun-Yi Lee
Paper Information
- arXiv ID: 2512.16473v1
- Categories: cs.DC
- Published: December 18, 2025
- PDF: Download PDF