[Paper] Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
Source: arXiv - 2511.20172v1
Overview
Large language models (LLMs) are hitting ever‑bigger parameter counts, and serving them with long context windows puts massive pressure on GPU memory. The paper introduces Beluga, a CXL‑based memory architecture that lets GPUs and CPUs share a huge, low‑latency memory pool for the KV‑Cache— the data structure that stores attention keys and values during inference. By moving away from RDMA‑based disaggregated memory, Beluga delivers near‑local memory speeds while keeping the programming model simple.
Key Contributions
- CXL‑enabled shared memory pool: Demonstrates how GPUs can perform native load/store operations over a CXL switch, eliminating the need for custom RDMA protocols.
- Design guidelines for CXL switches: Provides a systematic characterization of commercial CXL switch performance and derives practical rules for building scalable memory systems.
- Beluga‑KVCache system: Implements a KV‑Cache manager that leverages the shared pool, achieving up to 89.6 % reduction in Time‑to‑First‑Token (TTFT) and 7.35× higher throughput in the popular vLLM inference engine.
- Prototype and evaluation: Builds a working prototype on off‑the‑shelf hardware and validates the latency/throughput gains against state‑of‑the‑art RDMA solutions.
Methodology
- Hardware platform: The authors assemble a testbed consisting of GPUs, CPUs, and a commercial CXL switch that connects a large DRAM pool (tens of terabytes).
- Micro‑benchmarking: They run a suite of latency and bandwidth tests to understand how CXL behaves under different access patterns (random vs. sequential, small vs. large transfers).
- Guideline extraction: From the measurements they derive rules—e.g., keep request sizes above 256 KB to amortize switch overhead, batch KV‑Cache updates to reduce contention, and pin frequently accessed pages to avoid page‑fault penalties.
- System design: Using the guidelines, they build Beluga‑KVCache, a software layer that maps KV‑Cache entries directly into the shared CXL memory and exposes a simple API to the inference engine.
- Evaluation: They integrate Beluga‑KVCache with vLLM, a high‑performance LLM serving framework, and compare against an RDMA‑based disaggregated memory baseline across several model sizes (7B‑65B) and context lengths (up to 32 K tokens).
Results & Findings
| Metric | RDMA baseline | Beluga‑KVCache |
|---|---|---|
| TTFT (first token latency) | 1.00 s (normalized) | 0.10 s (‑89.6 %) |
| Throughput (tokens/s) | 1× | 7.35× |
| Average KV‑Cache access latency | ~2.3 µs (network + CPU) | ~0.3 µs (near‑local) |
| Scalability (num GPUs) | Degrades after 4 GPUs (network saturation) | Scales linearly up to 8 GPUs (CXL bandwidth sufficient) |
The data show that moving KV‑Cache storage onto a CXL‑backed pool cuts the critical path latency by an order of magnitude and unlocks much higher token‑per‑second rates, especially for long‑context workloads where the cache dominates memory traffic.
Practical Implications
- LLM SaaS providers can dramatically lower inference cost per request by shrinking the time‑to‑first‑token, which directly translates to better user experience and lower cloud billings.
- Hardware architects get a concrete reference design for integrating CXL switches into GPU‑centric AI servers, making it easier to provision terabytes of “GPU‑accessible” memory without over‑provisioning HBM.
- Framework developers (e.g., PyTorch, TensorFlow, vLLM) can adopt the Beluga‑KVCache API to offload KV‑Cache handling to a shared pool, simplifying memory management code and reducing the need for custom RDMA layers.
- Edge and on‑prem deployments that cannot afford massive GPU memory can still serve large‑context LLMs by attaching a modest CXL memory module, extending the life of existing GPU fleets.
Limitations & Future Work
- Hardware availability: The prototype relies on a commercial CXL switch that is still early‑stage; broader adoption may be limited until the ecosystem matures.
- Cache coherence: The current design assumes a single writer per KV‑Cache segment; extending to fully coherent multi‑writer scenarios would need additional protocol support.
- Software integration overhead: While the paper shows impressive gains in vLLM, integrating Beluga‑KVCache into other frameworks may require non‑trivial engineering effort.
- Future directions: The authors suggest exploring hierarchical CXL pools (e.g., combining local HBM, local DRAM, and remote CXL memory), adaptive placement policies for KV‑Cache entries, and tighter coupling with upcoming CXL 2.0 features like memory pooling and device‑to‑device communication.
Authors
- Xinjun Yang
- Qingda Hu
- Junru Li
- Feifei Li
- Yuqi Zhou
- Yicong Zhu
- Qiuru Lin
- Jian Dai
- Yang Kong
- Jiayu Zhang
- Guoqiang Xu
- Qiang Liu
Paper Information
- arXiv ID: 2511.20172v1
- Categories: cs.DC, cs.AI
- Published: November 25, 2025
- PDF: Download PDF