[Paper] One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

Published: (May 5, 2026 at 11:25 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.04450v1

Overview

The paper introduces HELM, a runtime system that dynamically partitions GPU high‑bandwidth memory (HBM) between two competing caches used by generative recommender models: the embedding hot cache (EMB) and the key‑value (KV) cache. By continuously adapting the memory split and routing requests intelligently, HELM closes a 20‑30 % latency gap that static allocations leave on the table, while keeping throughput intact.

Key Contributions

  • Joint HBM allocation & request routing – First system to treat EMB and KV caches as a coupled resource rather than optimizing them separately.
  • Three‑layer PPO controller – A lightweight reinforcement‑learning (proximal policy optimization) controller that combines a frozen base policy, an online residual adapter, and a burst‑aware recovery module, delivering decisions in ~32 µs.
  • KV‑aware scheduling algorithm – Routes inference requests based on current KV residency, embedding locality, and node load, preventing costly H2D data refills during bursts.
  • Real‑world evaluation – Demonstrates 24‑38 % P99 latency reduction and >93 % SLO compliance on a 32‑node A100 cluster across steady, trending, and bursty workloads, outperforming state‑of‑the‑art baselines.
  • Practical latency‑optimality – Stays within 0.024–0.029 of the offline‑optimal EMB/KV memory ratio, a precision rarely achieved in online systems.

Methodology

  1. Problem framing – The authors model the EMB/KV memory split as a continuous control problem: the goal is to pick a ratio that minimizes tail latency while respecting the fixed HBM capacity of each GPU.

  2. Three‑layer PPO controller

    • Base policy: Trained offline on historical workload traces; frozen during serving to provide a strong prior.
    • Residual adapter: A lightweight online learner that fine‑tunes the base decision using recent latency feedback, allowing rapid adaptation to workload shifts.
    • Burst‑aware recovery: Detects sudden traffic spikes and temporarily overrides the residual adapter to avoid over‑reacting to transient noise.

    The controller consumes a compact state vector (e.g., recent cache hit rates, request arrival rates, node load) and outputs a new EMB‑to‑KV ratio every few milliseconds.

  3. EMB‑KV‑aware scheduling – When a request arrives, the scheduler checks:

    • Whether its required KV entries already reside in the current GPU’s KV cache.
    • The locality of its embedding vectors (favoring GPUs where embeddings are hot).
    • Current load on each node.

    It then picks the best GPU, avoiding a costly host‑to‑device (H2D) refill that would otherwise stall the critical path.

  4. Evaluation setup – Experiments use three production‑scale recommendation datasets, a 32‑node NVIDIA A100 cluster, and three workload patterns (steady, trending, burst). Baselines include static memory partitions, separate EMB/KV optimizers, and prior adaptive cache managers.

Results & Findings

MetricStatic bestPrior adaptiveHELM
P99 latency reduction vs. static12–18 %24–38 %
SLO (99‑th percentile) satisfaction70–85 %80–92 %93.5–99.6 %
Throughput impactbaseline~‑2 %~0 % (unchanged)
Decision latency (controller)N/AN/A≈32 µs
Memory‑ratio optimality gap0.05–0.070.03–0.040.024–0.029
  • The optimal EMB/KV split can swing by up to 0.35 (35 % of HBM) when moving from a steady to a bursty regime; HELM tracks this shift in real time.
  • Naïve reallocation (e.g., moving memory on the fly without scheduling) caused P99 violations in >40 % of bursts; HELM’s joint scheduler eliminated those violations.
  • Even under extreme burst spikes, the burst‑aware recovery controller kept latency spikes bounded, allowing the system to “snap back” to the optimal ratio within a few milliseconds.

Practical Implications

  • Deployable on existing GPU clusters – HELM runs as a thin runtime layer on top of standard inference frameworks (e.g., TensorRT, PyTorch) and requires only metric hooks (cache hit rates, request timestamps).
  • Cost savings – By squeezing more latency out of the same hardware, operators can serve more users per GPU or meet tighter SLAs without adding expensive nodes.
  • Generalizable pattern – The three‑layer PPO architecture can be repurposed for any scenario where multiple in‑memory structures compete for a fixed accelerator memory budget (e.g., transformer KV caches vs. activation buffers).
  • Improved user experience – Lower tail latency directly translates to faster recommendation refreshes, higher click‑through rates, and better A/B‑test outcomes for product teams.
  • Simplified ops – The system automatically adapts to workload trends (e.g., seasonal traffic spikes) without manual retuning of cache sizes, reducing the operational burden on MLOps engineers.

Limitations & Future Work

  • GPU‑specific – HELM is evaluated on NVIDIA A100 GPUs; porting to other accelerators (e.g., AMD Instinct, Intel Xe GPUs) may require re‑training the base policy due to different memory hierarchies.
  • Model‑agnostic assumptions – The approach assumes a clear separation between embedding and KV caches; models that fuse these structures or use alternative memory layouts might need custom adaptations.
  • Training overhead – While the online residual adapter is lightweight, the initial offline training of the base policy still requires a representative workload trace, which may be costly for new services.
  • Scalability of state collection – Collecting fine‑grained cache statistics at sub‑millisecond granularity could become a bottleneck on extremely large clusters; future work could explore hierarchical or sampled telemetry.
  • Extending to multi‑tenant scenarios – The current scheduler treats all requests equally; incorporating priority or fairness guarantees across tenants is an open direction.

Overall, HELM showcases how a tightly coupled memory‑allocation and request‑routing strategy can unlock significant latency gains for generative recommender serving, offering a practical blueprint for production teams looking to squeeze more performance out of their GPU fleets.

Authors

  • Wenjun Yu
  • Shuguang Han
  • Amelie Chi Zhou

Paper Information

  • arXiv ID: 2605.04450v1
  • Categories: cs.DC, cs.IR, cs.LG
  • Published: May 6, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...