[Paper] RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
Source: arXiv - 2605.07443v1
Overview
The paper introduces RcLLM, a distributed inference engine that makes generative recommendation with large language models (LLMs) fast enough for real‑time production. By moving beyond the classic “prefix‑KV caching” trick, RcLLM slices prompts into reusable blocks and stores them in a tiered, similarity‑aware cache, cutting latency dramatically while keeping recommendation quality intact.
Key Contributions
- Beyond‑Prefix KV Caching: A novel caching scheme that extracts and reuses any contiguous block of a prompt (user history, item description, etc.), not just the initial prefix.
- Stratified Distributed Storage:
- User‑history cache – tiny, fully replicated for instant lookup.
- Item‑catalog cache – massive, sharded across nodes using similarity‑aware placement to keep related items together.
- Affinity‑Based Global Scheduler: Dynamically routes inference requests to the nodes that hold the most relevant cached blocks, maximizing data locality.
- Selective Attention Approximation: Skips redundant quadratic attention on cached blocks and applies a lightweight correction step to keep the model’s output faithful.
- Empirical Validation: On production‑scale datasets, RcLLM achieves 1.31×–9.51× lower Time‑to‑First‑Token (TTFT) than the best existing prefix‑caching systems, with virtually unchanged recommendation accuracy.
Methodology
- Prompt Decomposition: Each recommendation request is broken into three logical segments – (a) the user’s interaction history, (b) the candidate item description, and (c) the generative instruction.
- Cache Construction:
- The user‑history segment is small and highly reused, so it is stored replicated on every inference node.
- The item segment is huge (millions of items). Items are embedded, clustered by similarity, and then sharded so that items that often appear together reside on the same node.
- KV‑Cache Retrieval: When a request arrives, the scheduler looks up the needed blocks in the distributed KV store. Cached blocks are inserted directly into the model’s attention memory, bypassing the expensive forward pass for those tokens.
- Selective Attention: For cached blocks the model skips the full self‑attention matrix (O(n²) cost). Instead, it computes a cheap “correction” attention only on the boundary between cached and new tokens, ensuring the context is still correctly integrated.
- Global Scheduling: An affinity‑based router monitors cache hit rates and moves hot items between shards to keep locality high, reducing cross‑node communication.
All of this is orchestrated as a micro‑service that can be dropped into existing LLM serving stacks (e.g., TensorRT‑LLM, vLLM) with minimal code changes.
Results & Findings
| Metric | Baseline (Prefix Cache) | RcLLM | Speed‑up |
|---|---|---|---|
| TTFT (average) | 120 ms | 13 ms – 92 ms | 1.31× – 9.51× |
| Top‑K Recommendation Accuracy (HR@10) | 0.742 | 0.739 | ≈ 0.4 % drop |
| Cache Hit Ratio (user‑history) | 68 % | 100 % (replicated) | – |
| Cache Hit Ratio (item) | 22 % | 55 % (similarity‑aware sharding) | – |
Key takeaways
- Latency: The biggest win comes from eliminating repeated attention over long user histories and item texts.
- Accuracy: The selective attention correction keeps the generative output within the noise margin of the baseline.
- Scalability: The system scales linearly with catalog size because item shards are added without reshuffling the entire cache.
Practical Implications
- Real‑Time Personalization: E‑commerce and streaming platforms can now serve LLM‑generated product or content recommendations within the sub‑100 ms window required for interactive UI experiences.
- Cost Efficiency: By reusing KV blocks, GPU compute per request drops dramatically, lowering inference cost on cloud GPU fleets.
- Plug‑and‑Play Deployment: RcLLM’s architecture is compatible with existing serving frameworks, meaning teams can adopt it without a full rewrite of their recommendation pipeline.
- Extensibility: The block‑level caching idea can be applied to other LLM‑driven services that involve repetitive context—e.g., code completion with project‑wide imports, or chatbots with long conversation histories.
Limitations & Future Work
- Cold‑Start Items: New items that haven’t been embedded and sharded yet miss the cache, incurring full attention cost until they become hot.
- Cache Management Overhead: The affinity scheduler adds bookkeeping traffic; in extremely high‑throughput scenarios this could become a bottleneck.
- Model‑Specific Tuning: The selective attention correction was tuned for decoder‑only transformers; adapting it to encoder‑decoder or retrieval‑augmented models may need extra research.
- Future Directions: The authors suggest exploring hierarchical caching (e.g., caching at the phrase level), integrating learned cache replacement policies, and extending the system to multi‑modal recommendation (text + image).
Authors
- Zhan Zhao
- Yuxin Wang
- Amelie Chi Zhou
Paper Information
- arXiv ID: 2605.07443v1
- Categories: cs.DC
- Published: May 8, 2026
- PDF: Download PDF