[Paper] Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
Source: arXiv
Overview
Edge‑AI devices are becoming the backbone of many IoT services, but their on‑chip AI accelerators (e.g., Google’s Edge TPU) have very limited memory. When a neural‑network model exceeds the accelerator’s RAM, the runtime must constantly swap model fragments between the host CPU and the TPU, adding tens or hundreds of milliseconds of latency.
SwapLess is a new system that dynamically decides:
- How much of a model should run on the CPU vs. the TPU
- How many CPU cores to allocate
even when several applications (tenants) share the same edge node. By continuously adapting these decisions, SwapLess dramatically reduces inference latency without requiring developers to rewrite models.
Key Contributions
- Adaptive collaborative inference – Introduces a runtime that jointly schedules CPU and TPU work, automatically selecting the optimal split point in the neural‑network graph.
- Analytic queue‑theoretic model – Derives a lightweight mathematical model that predicts end‑to‑end latency for any CPU/TPU partition, accounting for both intra‑model swapping (within a single request) and inter‑model swapping (across concurrent tenants).
- Online optimization loop – Implements a low‑overhead controller that monitors request arrival rates and tenant mixes, then re‑configures the partition point and CPU‑core allocation in real time.
- Multi‑tenant awareness – Extends the model to handle heterogeneous workloads (different models, batch sizes, QoS targets) sharing the same Edge TPU, something prior single‑tenant solutions ignore.
- Prototype and evaluation – Deploys SwapLess on commodity Edge‑TPU hardware (Coral Dev Board) and demonstrates up to 63.8 % latency reduction for a single tenant and 77.4 % for multi‑tenant mixes compared with the stock Edge TPU compiler.
Methodology
-
Profiling Phase
- Run each supported model on the CPU and on the TPU separately to collect baseline service‑time distributions (how long each layer takes on each processor).
- Measure the cost of moving a tensor slice between host memory and TPU memory for different tensor sizes.
-
Queue‑ing Model Construction
- Treat each incoming inference request as a job that traverses two service stations:
- CPU stage – processes the first k layers.
- TPU stage – processes the remaining layers.
- Insert “swap” delays whenever a tensor must cross the CPU‑TPU boundary.
- For multi‑tenant scenarios, model each tenant’s request stream as an independent Poisson arrival process, sharing the same CPU cores and TPU resource.
- Treat each incoming inference request as a job that traverses two service stations:
-
Analytic Latency Estimation
- Using the measured service‑time statistics, compute the expected waiting time at each station via classic M/M/c queue formulas.
- Combine waiting times and swap overheads to obtain a closed‑form estimate of end‑to‑end response time for any partition point k and any allocation of CPU cores.
-
Online Optimization Loop
- Continuously monitor the current request arrival rates and the observed latency.
- Plug the live rates into the analytic model, enumerate a small set of candidate partitions (e.g., k = {2, 4, 6, …}) and CPU‑core allocations.
- Select the configuration that minimizes the predicted latency, then re‑configure the inference pipeline on‑the‑fly (the Edge TPU compiler supports dynamic graph re‑partitioning).
-
Implementation
- Built a lightweight controller in C++ that runs on the host CPU, communicates with the Edge TPU via the Coral API, and exposes a simple REST endpoint for client applications.
- Integrated with TensorFlow Lite models; no changes to model architecture are required.
Results & Findings
| Scenario | Baseline (Edge TPU compiler) | SwapLess | Latency Reduction |
|---|---|---|---|
| Single‑tenant (ResNet‑50, 1 req/s) | 210 ms | 76 ms | 63.8 % |
| Multi‑tenant (ResNet‑50 + MobileNet‑V2, 3 req/s total) | 340 ms | 77 ms | 77.4 % |
| High request burst (5 req/s) | 420 ms | 150 ms | 64 % |
| Varying CPU core count (1‑4 cores) | – | – | Adaptive allocation yields up to 2× throughput gain vs. static 1‑core allocation |
- Swap overhead dominates when the model is split too deep into the TPU (many large tensors must be swapped).
- CPU overload becomes the bottleneck when the split is too shallow (most layers run on the CPU).
- SwapLess automatically finds the sweet spot, even as the tenant mix shifts from compute‑heavy (ResNet) to lightweight (MobileNet).
- The decision engine adds < 2 ms of overhead per re‑partition—negligible compared with the overall latency savings.
Practical Implications
- Faster edge inference for latency‑sensitive apps – Real‑time video analytics, autonomous drones, and safety‑critical sensor processing can meet tighter response‑time SLAs without upgrading hardware.
- Higher device utilization – By squeezing more models onto a single Edge TPU, operators can host multiple services on the same physical node, reducing deployment cost and simplifying fleet management.
- Dynamic workload handling – In environments where request rates fluctuate (e.g., smart‑city cameras that see bursts of activity), SwapLess automatically re‑balances CPU/TPU work, avoiding manual retuning.
- Zero‑code migration path – Developers keep their existing TensorFlow Lite models; the only change is to link against the SwapLess runtime. This lowers the barrier to adoption for startups and OEMs.
- Potential for other accelerators – The queue‑theoretic framework is generic; it could be ported to other memory‑constrained AI chips (e.g., NVIDIA Jetson Nano, Intel Movidius) to achieve similar latency gains.
Limitations & Future Work
- Model‑size assumption – SwapLess currently requires that the entire model fit in host memory; extremely large models that exceed RAM are not addressed.
- Static profiling – The analytic model relies on offline profiling of each layer; dynamic changes in CPU frequency or thermal throttling could degrade prediction accuracy.
- Limited to TensorFlow Lite – Although the concepts are transferable, the prototype only supports TFLite graphs. Extending support to PyTorch Mobile or ONNX Runtime will broaden applicability.
- Queueing‑model simplifications – The M/M/c assumption may not capture bursty traffic patterns perfectly; future work could explore more sophisticated stochastic models or machine‑learning‑based predictors.
- Security & isolation – Multi‑tenant inference raises concerns about side‑channel leakage; integrating sandboxing or secure enclaves is an open research direction.
SwapLess demonstrates that smart, adaptive collaboration between CPUs and edge‑AI accelerators can unlock substantial performance gains on already‑deployed hardware. For developers building latency‑critical IoT services, it offers a practical pathway to squeeze more inference work out of the same edge node, paving the way for richer, more responsive on‑device intelligence.
Authors
- Nathan Ng
- Walid A. Hanafy
- Prashanthi Kadambi
- Balachandra Sunil
- Ayush Gupta
- David Irwin
- Yogesh Simmhan
- Prashant Shenoy
Paper Information
| Item | Details |
|---|---|
| arXiv ID | 2602.17808v1 |
| Categories | cs.DC, cs.PF |
| Published | February 19, 2026 |
| Download PDF |