[Paper] Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

Published: 2 months ago (February 19, 2026 at 03:17 PM EST)

6 min read

Source: arXiv

Source: arXiv

Overview

Edge‑AI devices are becoming the backbone of many IoT services, but their on‑chip AI accelerators (e.g., Google’s Edge TPU) have very limited memory. When a neural‑network model exceeds the accelerator’s RAM, the runtime must constantly swap model fragments between the host CPU and the TPU, adding tens or hundreds of milliseconds of latency.

SwapLess is a new system that dynamically decides:

How much of a model should run on the CPU vs. the TPU
How many CPU cores to allocate

even when several applications (tenants) share the same edge node. By continuously adapting these decisions, SwapLess dramatically reduces inference latency without requiring developers to rewrite models.

Key Contributions

Adaptive collaborative inference – Introduces a runtime that jointly schedules CPU and TPU work, automatically selecting the optimal split point in the neural‑network graph.
Analytic queue‑theoretic model – Derives a lightweight mathematical model that predicts end‑to‑end latency for any CPU/TPU partition, accounting for both intra‑model swapping (within a single request) and inter‑model swapping (across concurrent tenants).
Online optimization loop – Implements a low‑overhead controller that monitors request arrival rates and tenant mixes, then re‑configures the partition point and CPU‑core allocation in real time.
Multi‑tenant awareness – Extends the model to handle heterogeneous workloads (different models, batch sizes, QoS targets) sharing the same Edge TPU, something prior single‑tenant solutions ignore.
Prototype and evaluation – Deploys SwapLess on commodity Edge‑TPU hardware (Coral Dev Board) and demonstrates up to 63.8 % latency reduction for a single tenant and 77.4 % for multi‑tenant mixes compared with the stock Edge TPU compiler.

Methodology

Profiling Phase
- Run each supported model on the CPU and on the TPU separately to collect baseline service‑time distributions (how long each layer takes on each processor).
- Measure the cost of moving a tensor slice between host memory and TPU memory for different tensor sizes.
Queue‑ing Model Construction
- Treat each incoming inference request as a job that traverses two service stations:
  1. CPU stage – processes the first k layers.
  2. TPU stage – processes the remaining layers.
- Insert “swap” delays whenever a tensor must cross the CPU‑TPU boundary.
- For multi‑tenant scenarios, model each tenant’s request stream as an independent Poisson arrival process, sharing the same CPU cores and TPU resource.
Analytic Latency Estimation
- Using the measured service‑time statistics, compute the expected waiting time at each station via classic M/M/c queue formulas.
- Combine waiting times and swap overheads to obtain a closed‑form estimate of end‑to‑end response time for any partition point k and any allocation of CPU cores.
Online Optimization Loop
- Continuously monitor the current request arrival rates and the observed latency.
- Plug the live rates into the analytic model, enumerate a small set of candidate partitions (e.g., k = {2, 4, 6, …}) and CPU‑core allocations.
- Select the configuration that minimizes the predicted latency, then re‑configure the inference pipeline on‑the‑fly (the Edge TPU compiler supports dynamic graph re‑partitioning).
Implementation
- Built a lightweight controller in C++ that runs on the host CPU, communicates with the Edge TPU via the Coral API, and exposes a simple REST endpoint for client applications.
- Integrated with TensorFlow Lite models; no changes to model architecture are required.

Results & Findings

Scenario	Baseline (Edge TPU compiler)	SwapLess	Latency Reduction
Single‑tenant (ResNet‑50, 1 req/s)	210 ms	76 ms	63.8 %
Multi‑tenant (ResNet‑50 + MobileNet‑V2, 3 req/s total)	340 ms	77 ms	77.4 %
High request burst (5 req/s)	420 ms	150 ms	64 %
Varying CPU core count (1‑4 cores)	–	–	Adaptive allocation yields up to 2× throughput gain vs. static 1‑core allocation

Swap overhead dominates when the model is split too deep into the TPU (many large tensors must be swapped).
CPU overload becomes the bottleneck when the split is too shallow (most layers run on the CPU).
SwapLess automatically finds the sweet spot, even as the tenant mix shifts from compute‑heavy (ResNet) to lightweight (MobileNet).
The decision engine adds < 2 ms of overhead per re‑partition—negligible compared with the overall latency savings.

Practical Implications

Faster edge inference for latency‑sensitive apps – Real‑time video analytics, autonomous drones, and safety‑critical sensor processing can meet tighter response‑time SLAs without upgrading hardware.
Higher device utilization – By squeezing more models onto a single Edge TPU, operators can host multiple services on the same physical node, reducing deployment cost and simplifying fleet management.
Dynamic workload handling – In environments where request rates fluctuate (e.g., smart‑city cameras that see bursts of activity), SwapLess automatically re‑balances CPU/TPU work, avoiding manual retuning.
Zero‑code migration path – Developers keep their existing TensorFlow Lite models; the only change is to link against the SwapLess runtime. This lowers the barrier to adoption for startups and OEMs.
Potential for other accelerators – The queue‑theoretic framework is generic; it could be ported to other memory‑constrained AI chips (e.g., NVIDIA Jetson Nano, Intel Movidius) to achieve similar latency gains.

Limitations & Future Work

Model‑size assumption – SwapLess currently requires that the entire model fit in host memory; extremely large models that exceed RAM are not addressed.
Static profiling – The analytic model relies on offline profiling of each layer; dynamic changes in CPU frequency or thermal throttling could degrade prediction accuracy.
Limited to TensorFlow Lite – Although the concepts are transferable, the prototype only supports TFLite graphs. Extending support to PyTorch Mobile or ONNX Runtime will broaden applicability.
Queueing‑model simplifications – The M/M/c assumption may not capture bursty traffic patterns perfectly; future work could explore more sophisticated stochastic models or machine‑learning‑based predictors.
Security & isolation – Multi‑tenant inference raises concerns about side‑channel leakage; integrating sandboxing or secure enclaves is an open research direction.

SwapLess demonstrates that smart, adaptive collaboration between CPUs and edge‑AI accelerators can unlock substantial performance gains on already‑deployed hardware. For developers building latency‑critical IoT services, it offers a practical pathway to squeeze more inference work out of the same edge node, paving the way for richer, more responsive on‑device intelligence.

Authors

Nathan Ng
Walid A. Hanafy
Prashanthi Kadambi
Balachandra Sunil
Ayush Gupta
David Irwin
Yogesh Simmhan
Prashant Shenoy

Paper Information

Item	Details
arXiv ID	`2602.17808v1`
Categories	`cs.DC`, `cs.PF`
Published	February 19, 2026
PDF	Download PDF

[Paper] Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

OpenAI Calls In the Consultants For Its Enterprise Push

Google clamps down on Antigravity 'malicious usage', cutting off OpenClaw users in sweeping ToS enforcement move

Anthropic: Chinese AI firms created 24,000 fraudulent accounts for distillation attacks

One engineer made a production SaaS product in an hour: here's the governance system that made it possible