[Paper] Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing

Published: (December 21, 2025 at 05:27 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.18674v1

Overview

The paper introduces Remoe, a system that makes inference with large Mixture‑of‑Experts (MoE) language models cheap and fast in a serverless environment. By cleverly splitting work between GPUs, CPUs, and on‑demand serverless functions, Remoe cuts both memory pressure and compute cost—key pain points when serving bursty LLM workloads.

Key Contributions

  • Heterogeneous execution model – non‑expert (dense) layers run on GPUs while expert layers run on CPUs; rarely‑used experts are offloaded to separate serverless functions.
  • Similar‑Prompt Search (SPS) – a lightweight algorithm that predicts which experts will fire for a new request by measuring semantic similarity to previously seen prompts.
  • Main‑Model Pre‑allocation (MMP) – a worst‑case memory estimator that guarantees service‑level objectives (SLOs) without over‑provisioning.
  • Joint memory‑replica optimizer – formulates the placement and replication problem as a Lagrangian dual and solves it with a Longest Processing Time (LPT) heuristic, balancing latency, cost, and memory usage.
  • Prototype on Kubernetes – end‑to‑end implementation evaluated on several LLM benchmarks, showing up to 57 % cost reduction and 47 % lower cold‑start latency versus prior approaches.

Methodology

1. System Partitioning

  • The main (dense) part of the MoE model stays on a GPU, leveraging its high throughput for matrix multiplications.
  • Each expert (a relatively small feed‑forward sub‑network) is assigned to a CPU core; because experts are activated sparsely, CPU memory is sufficient.
  • Experts that are rarely selected (based on historical activation frequencies) are packaged into independent serverless functions (e.g., AWS Lambda, Azure Functions). When needed, the function is invoked on‑the‑fly, keeping the resident memory footprint tiny.

2. Predicting Expert Activation (SPS)

  • For an incoming prompt, Remoe computes a short embedding (e.g., using a lightweight encoder).
  • It then searches a cache of recent prompts for the most semantically similar ones and reuses their expert‑selection pattern.
  • This prediction is fast (sub‑millisecond) and accurate enough to pre‑warm the required serverless functions.

3. Memory Guarantees (MMP)

  • The authors derive a worst‑case bound on how many experts could be active simultaneously for any request.
  • Using this bound, they pre‑allocate GPU/CPU memory so that SLOs (e.g., 95th‑percentile latency < X ms) are met without over‑allocating resources.

4. Optimization Framework

  • The placement problem (which expert on CPU vs. serverless) and the replication factor (how many copies of each expert to keep warm) are expressed as a convex Lagrangian.
  • Solving the dual yields marginal costs for each decision; the LPT heuristic then schedules experts to workers to minimize the makespan (overall latency).

Results & Findings

MetricBaseline (state‑of‑the‑art)Remoe
Inference cost (per 1 M tokens)$0.112$0.048 (‑57 %)
Cold‑start latency210 ms112 ms (‑47 %)
Peak memory usage (GPU)22 GB13 GB (‑41 %)
99‑th‑percentile latency420 ms298 ms (‑29 %)
  • The cost savings stem mainly from moving the bulk of expert parameters off the GPU and only loading them on demand.
  • SPS correctly predicts the active expert set for > 92 % of queries, which keeps the extra serverless invocation overhead negligible.
  • The LPT‑based scheduler achieves near‑optimal makespan compared to an exhaustive search (within 5 % on average).

Practical Implications

  • Serverless‑first LLM services – Companies can now expose MoE‑based chatbots or code generators without maintaining a fleet of GPU‑heavy VMs; most of the workload runs on cheap CPUs or pay‑as‑you‑go functions.
  • Cost‑effective burst handling – During traffic spikes, Remoe scales out serverless experts instantly, avoiding the need to over‑provision GPU capacity for rare queries.
  • Simplified DevOps – The memory‑preallocation guarantees make it easier to set SLOs in CI/CD pipelines; developers can rely on deterministic latency budgets.
  • Edge‑aware deployments – Because experts can be placed on any compute node, a similar pattern could be used for edge‑cloud hybrid inference where bandwidth is limited.

For developers, the key takeaway is that you no longer need to choose between “fast but expensive GPU inference” and “cheap but slow dense models.” Remoe offers a middle ground that leverages existing serverless platforms and standard Kubernetes tooling.

Limitations & Future Work

  • Prediction accuracy trade‑off – SPS may mis‑predict expert sets for highly novel prompts, leading to extra serverless cold starts.
  • CPU‑bound expert execution – While CPUs are sufficient for most experts, extremely large expert networks could saturate CPU cores, requiring further profiling.
  • Vendor lock‑in – The prototype relies on Kubernetes and specific serverless runtimes; portability to other orchestration systems needs validation.
  • Security & isolation – Offloading experts to shared serverless functions raises concerns about model leakage; future work could explore encrypted execution or TEEs.

The authors suggest extending the optimizer to handle multi‑tenant scenarios and exploring adaptive SPS models that learn from mis‑predictions in real time.

Authors

  • Wentao Liu
  • Yuhao Hu
  • Ruiting Zhou
  • Baochun Li
  • Ne Wang

Paper Information

  • arXiv ID: 2512.18674v1
  • Categories: cs.DC, cs.AI
  • Published: December 21, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »