[Paper] One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

Published: 5 days ago (May 5, 2026 at 11:25 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.04450v1

Overview

The paper introduces HELM, a runtime system that dynamically partitions GPU high‑bandwidth memory (HBM) between two competing caches used by generative recommender models: the embedding hot cache (EMB) and the key‑value (KV) cache. By continuously adapting the memory split and routing requests intelligently, HELM closes a 20‑30 % latency gap that static allocations leave on the table, while keeping throughput intact.

Key Contributions

Joint HBM allocation & request routing – First system to treat EMB and KV caches as a coupled resource rather than optimizing them separately.
Three‑layer PPO controller – A lightweight reinforcement‑learning (proximal policy optimization) controller that combines a frozen base policy, an online residual adapter, and a burst‑aware recovery module, delivering decisions in ~32 µs.
KV‑aware scheduling algorithm – Routes inference requests based on current KV residency, embedding locality, and node load, preventing costly H2D data refills during bursts.
Real‑world evaluation – Demonstrates 24‑38 % P99 latency reduction and >93 % SLO compliance on a 32‑node A100 cluster across steady, trending, and bursty workloads, outperforming state‑of‑the‑art baselines.
Practical latency‑optimality – Stays within 0.024–0.029 of the offline‑optimal EMB/KV memory ratio, a precision rarely achieved in online systems.

Methodology

Problem framing – The authors model the EMB/KV memory split as a continuous control problem: the goal is to pick a ratio that minimizes tail latency while respecting the fixed HBM capacity of each GPU.
Three‑layer PPO controller
- Base policy: Trained offline on historical workload traces; frozen during serving to provide a strong prior.
- Residual adapter: A lightweight online learner that fine‑tunes the base decision using recent latency feedback, allowing rapid adaptation to workload shifts.
- Burst‑aware recovery: Detects sudden traffic spikes and temporarily overrides the residual adapter to avoid over‑reacting to transient noise.
The controller consumes a compact state vector (e.g., recent cache hit rates, request arrival rates, node load) and outputs a new EMB‑to‑KV ratio every few milliseconds.
EMB‑KV‑aware scheduling – When a request arrives, the scheduler checks:
- Whether its required KV entries already reside in the current GPU’s KV cache.
- The locality of its embedding vectors (favoring GPUs where embeddings are hot).
- Current load on each node.
It then picks the best GPU, avoiding a costly host‑to‑device (H2D) refill that would otherwise stall the critical path.
Evaluation setup – Experiments use three production‑scale recommendation datasets, a 32‑node NVIDIA A100 cluster, and three workload patterns (steady, trending, burst). Baselines include static memory partitions, separate EMB/KV optimizers, and prior adaptive cache managers.

Results & Findings

Metric	Static best	Prior adaptive	HELM
P99 latency reduction vs. static	–	12–18 %	24–38 %
SLO (99‑th percentile) satisfaction	70–85 %	80–92 %	93.5–99.6 %
Throughput impact	baseline	~‑2 %	~0 % (unchanged)
Decision latency (controller)	N/A	N/A	≈32 µs
Memory‑ratio optimality gap	0.05–0.07	0.03–0.04	0.024–0.029

The optimal EMB/KV split can swing by up to 0.35 (35 % of HBM) when moving from a steady to a bursty regime; HELM tracks this shift in real time.
Naïve reallocation (e.g., moving memory on the fly without scheduling) caused P99 violations in >40 % of bursts; HELM’s joint scheduler eliminated those violations.
Even under extreme burst spikes, the burst‑aware recovery controller kept latency spikes bounded, allowing the system to “snap back” to the optimal ratio within a few milliseconds.

Practical Implications

Deployable on existing GPU clusters – HELM runs as a thin runtime layer on top of standard inference frameworks (e.g., TensorRT, PyTorch) and requires only metric hooks (cache hit rates, request timestamps).
Cost savings – By squeezing more latency out of the same hardware, operators can serve more users per GPU or meet tighter SLAs without adding expensive nodes.
Generalizable pattern – The three‑layer PPO architecture can be repurposed for any scenario where multiple in‑memory structures compete for a fixed accelerator memory budget (e.g., transformer KV caches vs. activation buffers).
Improved user experience – Lower tail latency directly translates to faster recommendation refreshes, higher click‑through rates, and better A/B‑test outcomes for product teams.
Simplified ops – The system automatically adapts to workload trends (e.g., seasonal traffic spikes) without manual retuning of cache sizes, reducing the operational burden on MLOps engineers.

Limitations & Future Work

GPU‑specific – HELM is evaluated on NVIDIA A100 GPUs; porting to other accelerators (e.g., AMD Instinct, Intel Xe GPUs) may require re‑training the base policy due to different memory hierarchies.
Model‑agnostic assumptions – The approach assumes a clear separation between embedding and KV caches; models that fuse these structures or use alternative memory layouts might need custom adaptations.
Training overhead – While the online residual adapter is lightweight, the initial offline training of the base policy still requires a representative workload trace, which may be costly for new services.
Scalability of state collection – Collecting fine‑grained cache statistics at sub‑millisecond granularity could become a bottleneck on extremely large clusters; future work could explore hierarchical or sampled telemetry.
Extending to multi‑tenant scenarios – The current scheduler treats all requests equally; incorporating priority or fairness guarantees across tenants is an open direction.

Overall, HELM showcases how a tightly coupled memory‑allocation and request‑routing strategy can unlock significant latency gains for generative recommender serving, offering a practical blueprint for production teams looking to squeeze more performance out of their GPU fleets.

Authors

Wenjun Yu
Shuguang Han
Amelie Chi Zhou

Paper Information

arXiv ID: 2605.04450v1
Categories: cs.DC, cs.IR, cs.LG
Published: May 6, 2026
PDF: Download PDF

[Paper] One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction