[Paper] Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

Published: 2 months ago (November 25, 2025 at 05:51 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.20172v1

Overview

Large language models (LLMs) are hitting ever‑bigger parameter counts, and serving them with long context windows puts massive pressure on GPU memory. The paper introduces Beluga, a CXL‑based memory architecture that lets GPUs and CPUs share a huge, low‑latency memory pool for the KV‑Cache— the data structure that stores attention keys and values during inference. By moving away from RDMA‑based disaggregated memory, Beluga delivers near‑local memory speeds while keeping the programming model simple.

Key Contributions

CXL‑enabled shared memory pool: Demonstrates how GPUs can perform native load/store operations over a CXL switch, eliminating the need for custom RDMA protocols.
Design guidelines for CXL switches: Provides a systematic characterization of commercial CXL switch performance and derives practical rules for building scalable memory systems.
Beluga‑KVCache system: Implements a KV‑Cache manager that leverages the shared pool, achieving up to 89.6 % reduction in Time‑to‑First‑Token (TTFT) and 7.35× higher throughput in the popular vLLM inference engine.
Prototype and evaluation: Builds a working prototype on off‑the‑shelf hardware and validates the latency/throughput gains against state‑of‑the‑art RDMA solutions.

Methodology

Hardware platform: The authors assemble a testbed consisting of GPUs, CPUs, and a commercial CXL switch that connects a large DRAM pool (tens of terabytes).
Micro‑benchmarking: They run a suite of latency and bandwidth tests to understand how CXL behaves under different access patterns (random vs. sequential, small vs. large transfers).
Guideline extraction: From the measurements they derive rules—e.g., keep request sizes above 256 KB to amortize switch overhead, batch KV‑Cache updates to reduce contention, and pin frequently accessed pages to avoid page‑fault penalties.
System design: Using the guidelines, they build Beluga‑KVCache, a software layer that maps KV‑Cache entries directly into the shared CXL memory and exposes a simple API to the inference engine.
Evaluation: They integrate Beluga‑KVCache with vLLM, a high‑performance LLM serving framework, and compare against an RDMA‑based disaggregated memory baseline across several model sizes (7B‑65B) and context lengths (up to 32 K tokens).

Results & Findings

Metric	RDMA baseline	Beluga‑KVCache
TTFT (first token latency)	1.00 s (normalized)	0.10 s (‑89.6 %)
Throughput (tokens/s)	1×	7.35×
Average KV‑Cache access latency	~2.3 µs (network + CPU)	~0.3 µs (near‑local)
Scalability (num GPUs)	Degrades after 4 GPUs (network saturation)	Scales linearly up to 8 GPUs (CXL bandwidth sufficient)

The data show that moving KV‑Cache storage onto a CXL‑backed pool cuts the critical path latency by an order of magnitude and unlocks much higher token‑per‑second rates, especially for long‑context workloads where the cache dominates memory traffic.

Practical Implications

LLM SaaS providers can dramatically lower inference cost per request by shrinking the time‑to‑first‑token, which directly translates to better user experience and lower cloud billings.
Hardware architects get a concrete reference design for integrating CXL switches into GPU‑centric AI servers, making it easier to provision terabytes of “GPU‑accessible” memory without over‑provisioning HBM.
Framework developers (e.g., PyTorch, TensorFlow, vLLM) can adopt the Beluga‑KVCache API to offload KV‑Cache handling to a shared pool, simplifying memory management code and reducing the need for custom RDMA layers.
Edge and on‑prem deployments that cannot afford massive GPU memory can still serve large‑context LLMs by attaching a modest CXL memory module, extending the life of existing GPU fleets.

Limitations & Future Work

Hardware availability: The prototype relies on a commercial CXL switch that is still early‑stage; broader adoption may be limited until the ecosystem matures.
Cache coherence: The current design assumes a single writer per KV‑Cache segment; extending to fully coherent multi‑writer scenarios would need additional protocol support.
Software integration overhead: While the paper shows impressive gains in vLLM, integrating Beluga‑KVCache into other frameworks may require non‑trivial engineering effort.
Future directions: The authors suggest exploring hierarchical CXL pools (e.g., combining local HBM, local DRAM, and remote CXL memory), adaptive placement policies for KV‑Cache entries, and tighter coupling with upcoming CXL 2.0 features like memory pooling and device‑to‑device communication.

Authors

Xinjun Yang
Qingda Hu
Junru Li
Feifei Li
Yuqi Zhou
Yicong Zhu
Qiuru Lin
Jian Dai
Yang Kong
Jiayu Zhang
Guoqiang Xu
Qiang Liu

Paper Information

arXiv ID: 2511.20172v1
Categories: cs.DC, cs.AI
Published: November 25, 2025
PDF: Download PDF

[Paper] Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Sycophancy is the first LLM 'dark pattern'

Why AI Alignment Starts With Better Evaluation

[Paper] Escaping the Verifier: Learning to Reason via Demonstrations

[Paper] Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO