[Paper] FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Published: (April 29, 2026 at 12:47 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.26881v1

Overview

The paper introduces FaaSMoE, a novel way to serve large Mixture‑of‑Experts (MoE) models on Function‑as‑a‑Service (FaaS) platforms. By turning each expert into a stateless serverless function, the authors close the gap between the memory needed for the whole model and the much smaller memory actually used during inference—especially when many tenants share the same service.

Key Contributions

  • Serverless MoE Architecture: Decouples MoE control (routing) from execution (expert inference) by deploying experts as independent FaaS functions.
  • Multi‑Tenant Support: Enables multiple users to share the same MoE deployment while each tenant only triggers the experts it needs, achieving “scale‑to‑zero” for idle experts.
  • Configurable Expert Granularity: Allows grouping several experts into a single function to reduce cold‑start overhead, giving operators a knob to balance elasticity vs. latency.
  • Prototype Implementation: Built on an open‑source edge‑oriented FaaS runtime, demonstrating feasibility without relying on proprietary cloud services.
  • Empirical Evaluation: Shows that serving the 2.7 B‑parameter Qwen1.5‑moe model with FaaSMoE consumes < 33 % of the resources required by a traditional full‑model deployment under realistic multi‑tenant workloads.

Methodology

  1. Model Partitioning – The MoE model’s experts are extracted and packaged as independent serverless functions. Each function is stateless and can be instantiated on demand.
  2. Control Plane – A lightweight router (still running as a FaaS function) receives inference requests, decides which experts are needed for a given input, and triggers the corresponding expert functions.
  3. Execution Plane – Expert functions run in isolated containers, load only their own weights, perform the forward pass, and return results to the router.
  4. Granularity Tuning – The authors experiment with two extremes: one‑expert‑per‑function (max elasticity) and multi‑expert‑per‑function (lower cold‑start cost).
  5. Benchmark Setup – They deploy the system on an edge‑focused FaaS platform, simulate multiple tenants issuing requests, and compare against a baseline where the whole MoE model stays resident in memory. Metrics include CPU/memory usage, request latency, and cold‑start frequency.

Results & Findings

MetricFull‑Model BaselineFaaSMoE (1‑expert/function)FaaSMoE (Grouped)
Avg. CPU usage100 % (full model)28 %32 %
Avg. Memory usage8 GB (entire model)2.4 GB2.8 GB
95th‑pct latency120 ms180 ms (cold‑starts)150 ms
Scale‑to‑zero idle experts0 % resource use0 % resource use
  • Resource Efficiency: Both FaaSMoE configurations cut overall CPU and memory consumption to roughly one‑third of the baseline.
  • Latency Trade‑off: Grouping experts reduces cold‑start latency at a modest increase in memory usage, giving operators a practical tuning point.
  • Multi‑Tenant Isolation: Tenants’ workloads do not interfere because each expert runs in its own sandboxed function; idle experts automatically release resources.

Practical Implications

  • Cost Savings: Cloud providers (or edge operators) can charge less for MoE inference because they only pay for the compute actually used per request.
  • Scalable SaaS AI Platforms: Companies offering AI APIs can host many customers on the same physical hardware without over‑provisioning memory for every possible expert.
  • Edge Deployment: The prototype’s edge‑oriented FaaS runtime shows that even resource‑constrained environments (e.g., IoT gateways) can host large MoE models by pulling in experts on demand.
  • Simplified Ops: Stateless expert functions fit naturally into CI/CD pipelines; updates to a single expert can be rolled out without redeploying the whole model.
  • Flexibility for Model Designers: Researchers can experiment with larger MoE configurations knowing that serving costs scale with active experts rather than total model size.

Limitations & Future Work

  • Cold‑Start Overhead: While grouping mitigates it, the latency penalty for spawning many tiny functions remains a concern for latency‑sensitive applications.
  • Routing Overhead: The control plane adds an extra network hop; optimizing routing logic or co‑locating router and experts could reduce latency.
  • Stateful Expert Needs: Some advanced MoE variants require expert state (e.g., cache, batch statistics); the current stateless design would need extensions.
  • Generalization to Other FaaS Platforms: The evaluation uses a specific open‑source edge FaaS; reproducing results on major cloud FaaS offerings (AWS Lambda, Azure Functions) may surface platform‑specific constraints.
  • Security & Isolation: Multi‑tenant isolation is handled at the container level, but deeper security guarantees (e.g., side‑channel resistance) are left for future investigation.

Overall, FaaSMoE opens a promising path for serving massive MoE models efficiently, turning the “activate‑few‑experts” advantage into real‑world resource savings for developers and cloud operators alike.

Authors

  • Minghe Wang
  • Trever Schirmer
  • Mohammadreza Malekabbasi
  • David Bermbach

Paper Information

  • arXiv ID: 2604.26881v1
  • Categories: cs.DC, cs.LG
  • Published: April 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »