[Paper] One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
Source: arXiv - 2605.04450v1
Overview
The paper introduces HELM, a runtime system that dynamically partitions GPU high‑bandwidth memory (HBM) between two competing caches used by generative recommender models: the embedding hot cache (EMB) and the key‑value (KV) cache. By continuously adapting the memory split and routing requests intelligently, HELM closes a 20‑30 % latency gap that static allocations leave on the table, while keeping throughput intact.
Key Contributions
- Joint HBM allocation & request routing – First system to treat EMB and KV caches as a coupled resource rather than optimizing them separately.
- Three‑layer PPO controller – A lightweight reinforcement‑learning (proximal policy optimization) controller that combines a frozen base policy, an online residual adapter, and a burst‑aware recovery module, delivering decisions in ~32 µs.
- KV‑aware scheduling algorithm – Routes inference requests based on current KV residency, embedding locality, and node load, preventing costly H2D data refills during bursts.
- Real‑world evaluation – Demonstrates 24‑38 % P99 latency reduction and >93 % SLO compliance on a 32‑node A100 cluster across steady, trending, and bursty workloads, outperforming state‑of‑the‑art baselines.
- Practical latency‑optimality – Stays within 0.024–0.029 of the offline‑optimal EMB/KV memory ratio, a precision rarely achieved in online systems.
Methodology
-
Problem framing – The authors model the EMB/KV memory split as a continuous control problem: the goal is to pick a ratio that minimizes tail latency while respecting the fixed HBM capacity of each GPU.
-
Three‑layer PPO controller
- Base policy: Trained offline on historical workload traces; frozen during serving to provide a strong prior.
- Residual adapter: A lightweight online learner that fine‑tunes the base decision using recent latency feedback, allowing rapid adaptation to workload shifts.
- Burst‑aware recovery: Detects sudden traffic spikes and temporarily overrides the residual adapter to avoid over‑reacting to transient noise.
The controller consumes a compact state vector (e.g., recent cache hit rates, request arrival rates, node load) and outputs a new EMB‑to‑KV ratio every few milliseconds.
-
EMB‑KV‑aware scheduling – When a request arrives, the scheduler checks:
- Whether its required KV entries already reside in the current GPU’s KV cache.
- The locality of its embedding vectors (favoring GPUs where embeddings are hot).
- Current load on each node.
It then picks the best GPU, avoiding a costly host‑to‑device (H2D) refill that would otherwise stall the critical path.
-
Evaluation setup – Experiments use three production‑scale recommendation datasets, a 32‑node NVIDIA A100 cluster, and three workload patterns (steady, trending, burst). Baselines include static memory partitions, separate EMB/KV optimizers, and prior adaptive cache managers.
Results & Findings
| Metric | Static best | Prior adaptive | HELM |
|---|---|---|---|
| P99 latency reduction vs. static | – | 12–18 % | 24–38 % |
| SLO (99‑th percentile) satisfaction | 70–85 % | 80–92 % | 93.5–99.6 % |
| Throughput impact | baseline | ~‑2 % | ~0 % (unchanged) |
| Decision latency (controller) | N/A | N/A | ≈32 µs |
| Memory‑ratio optimality gap | 0.05–0.07 | 0.03–0.04 | 0.024–0.029 |
- The optimal EMB/KV split can swing by up to 0.35 (35 % of HBM) when moving from a steady to a bursty regime; HELM tracks this shift in real time.
- Naïve reallocation (e.g., moving memory on the fly without scheduling) caused P99 violations in >40 % of bursts; HELM’s joint scheduler eliminated those violations.
- Even under extreme burst spikes, the burst‑aware recovery controller kept latency spikes bounded, allowing the system to “snap back” to the optimal ratio within a few milliseconds.
Practical Implications
- Deployable on existing GPU clusters – HELM runs as a thin runtime layer on top of standard inference frameworks (e.g., TensorRT, PyTorch) and requires only metric hooks (cache hit rates, request timestamps).
- Cost savings – By squeezing more latency out of the same hardware, operators can serve more users per GPU or meet tighter SLAs without adding expensive nodes.
- Generalizable pattern – The three‑layer PPO architecture can be repurposed for any scenario where multiple in‑memory structures compete for a fixed accelerator memory budget (e.g., transformer KV caches vs. activation buffers).
- Improved user experience – Lower tail latency directly translates to faster recommendation refreshes, higher click‑through rates, and better A/B‑test outcomes for product teams.
- Simplified ops – The system automatically adapts to workload trends (e.g., seasonal traffic spikes) without manual retuning of cache sizes, reducing the operational burden on MLOps engineers.
Limitations & Future Work
- GPU‑specific – HELM is evaluated on NVIDIA A100 GPUs; porting to other accelerators (e.g., AMD Instinct, Intel Xe GPUs) may require re‑training the base policy due to different memory hierarchies.
- Model‑agnostic assumptions – The approach assumes a clear separation between embedding and KV caches; models that fuse these structures or use alternative memory layouts might need custom adaptations.
- Training overhead – While the online residual adapter is lightweight, the initial offline training of the base policy still requires a representative workload trace, which may be costly for new services.
- Scalability of state collection – Collecting fine‑grained cache statistics at sub‑millisecond granularity could become a bottleneck on extremely large clusters; future work could explore hierarchical or sampled telemetry.
- Extending to multi‑tenant scenarios – The current scheduler treats all requests equally; incorporating priority or fairness guarantees across tenants is an open direction.
Overall, HELM showcases how a tightly coupled memory‑allocation and request‑routing strategy can unlock significant latency gains for generative recommender serving, offering a practical blueprint for production teams looking to squeeze more performance out of their GPU fleets.
Authors
- Wenjun Yu
- Shuguang Han
- Amelie Chi Zhou
Paper Information
- arXiv ID: 2605.04450v1
- Categories: cs.DC, cs.IR, cs.LG
- Published: May 6, 2026
- PDF: Download PDF