[Paper] Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
Source: arXiv - 2512.01357v1
Overview
Serverless deployments of large language models (LLMs) promise “pay‑as‑you‑go” AI services by sharing GPU resources across many users. In practice, however, the cold‑start latency—especially the time spent loading a model into GPU memory—can be prohibitive, growing linearly with model size. Tangram tackles this bottleneck by re‑using idle GPU memory and scheduling workloads with GPU‑affinity awareness, cutting model‑load times dramatically and making serverless LLMs viable for real‑world workloads.
Key Contributions
- Unified GPU memory pool that lets multiple models share tensor‑level parameter storage, eliminating redundant copies.
- On‑demand KV‑cache allocation that dynamically provisions attention cache memory only when needed, freeing space for other models.
- GPU‑affinity‑aware scheduler that places incoming inference requests on GPUs that already hold the required parameters, maximizing reuse.
- Prototype implementation integrated with a popular serverless inference framework, showing up to 6.2× faster model loading and 23‑55% lower Time‑to‑First‑Token (TTFT) compared with existing solutions.
Methodology
Tangram’s design revolves around three practical ideas that are easy to grasp even without deep systems expertise:
1. Memory Pooling Across Models
- Instead of loading each model’s weights into a fresh GPU allocation, Tangram creates a global pool of GPU memory.
- When a new model is requested, it checks whether its weight tensors already exist in the pool (e.g., shared layers among similar models) and reuses them directly, avoiding a full copy from host RAM.
2. Lazy KV‑Cache Allocation
- The key‑value (KV) cache used by transformer attention grows with the length of generated text.
- Tangram allocates this cache on demand per request, releasing it as soon as the generation finishes, which frees up space for other models waiting to load.
3. Affinity‑Aware Scheduling
- The runtime tracks which GPUs currently host which parameter tensors.
- When a request arrives, the scheduler prefers a GPU that already holds the needed tensors (high “affinity”), reducing the amount of data that must be transferred over PCIe.
The prototype plugs into an existing serverless inference stack (e.g., NVIDIA Triton or a custom function‑as‑a‑service layer) and intercepts the model‑load step to apply the above tricks transparently.
Results & Findings
| Metric | Baseline (state‑of‑the‑art) | Tangram | Improvement |
|---|---|---|---|
| Model load time (e.g., 13B‑parameter model) | 3.2 s | 0.52 s | ~6.2× faster |
| Time‑to‑First‑Token (cold start) | 1.8 s | 0.8 s | 23‑55 % reduction |
| GPU memory utilization (average) | 78 % | 92 % | Higher packing efficiency |
| Throughput under mixed‑model workload | 120 req/s | 158 req/s | ~30 % more requests served |
The experiments span a range of model sizes (7B‑30B parameters) and realistic serverless workloads (bursty request patterns). Tangram consistently reduces the cold‑start penalty without sacrificing inference latency once the model is loaded.
Practical Implications
- Lower Cost for Serverless AI – Faster loading means less GPU idle time, translating directly into lower per‑request billing for cloud providers and their customers.
- Higher Availability – Applications that previously suffered from “cold‑start spikes” (e.g., chatbots, code assistants) can now guarantee sub‑second first‑token responses even after periods of inactivity.
- Simplified Multi‑Model Hosting – Data‑science teams can expose many fine‑tuned variants of a base LLM on the same GPU cluster without manually managing memory partitions.
- Edge‑Ready Deployments – On devices with limited GPU memory (e.g., Jetson, RTX‑mobile), Tangram’s pooling and lazy cache can enable on‑demand loading of multiple compact LLMs, opening new use‑cases in robotics and AR.
Developers can adopt Tangram’s concepts by integrating its memory‑pool API or by mimicking its affinity‑aware scheduler in existing serverless platforms.
Limitations & Future Work
- Model Compatibility – Tangram assumes that models share a common architecture (e.g., same transformer block layout). Heterogeneous architectures (e.g., encoder‑decoder vs. decoder‑only) reduce reuse opportunities.
- GPU Interconnect Overheads – In multi‑GPU nodes, moving tensors between GPUs still incurs PCIe/NVLink latency; the current prototype does not fully exploit peer‑to‑peer transfers.
- Security Isolation – Sharing memory across tenants raises isolation concerns; the authors note the need for lightweight encryption or sandboxing mechanisms.
- Scalability to Hundreds of Models – While the pool works well for a modest set of models, managing metadata for thousands of variants may become a bottleneck.
Future research directions include extending Tangram to heterogeneous accelerator pools (e.g., CPU‑GPU‑TPU), adding secure memory enclaves for multi‑tenant safety, and exploring predictive pre‑loading based on request patterns to further shrink cold‑start latency.
Authors
- Wenbin Zhu
- Zhaoyan Shen
- Zili Shao
- Hongjun Dai
- Feng Chen
Paper Information
- arXiv ID: 2512.01357v1
- Categories: cs.DC, cs.AI, cs.AR
- Published: December 1, 2025
- PDF: Download PDF