[Paper] RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching

Published: 3 days ago (May 8, 2026 at 04:47 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.07443v1

Overview

The paper introduces RcLLM, a distributed inference engine that makes generative recommendation with large language models (LLMs) fast enough for real‑time production. By moving beyond the classic “prefix‑KV caching” trick, RcLLM slices prompts into reusable blocks and stores them in a tiered, similarity‑aware cache, cutting latency dramatically while keeping recommendation quality intact.

Key Contributions

Beyond‑Prefix KV Caching: A novel caching scheme that extracts and reuses any contiguous block of a prompt (user history, item description, etc.), not just the initial prefix.
Stratified Distributed Storage:
- User‑history cache – tiny, fully replicated for instant lookup.
- Item‑catalog cache – massive, sharded across nodes using similarity‑aware placement to keep related items together.
Affinity‑Based Global Scheduler: Dynamically routes inference requests to the nodes that hold the most relevant cached blocks, maximizing data locality.
Selective Attention Approximation: Skips redundant quadratic attention on cached blocks and applies a lightweight correction step to keep the model’s output faithful.
Empirical Validation: On production‑scale datasets, RcLLM achieves 1.31×–9.51× lower Time‑to‑First‑Token (TTFT) than the best existing prefix‑caching systems, with virtually unchanged recommendation accuracy.

Methodology

Prompt Decomposition: Each recommendation request is broken into three logical segments – (a) the user’s interaction history, (b) the candidate item description, and (c) the generative instruction.
Cache Construction:
- The user‑history segment is small and highly reused, so it is stored replicated on every inference node.
- The item segment is huge (millions of items). Items are embedded, clustered by similarity, and then sharded so that items that often appear together reside on the same node.
KV‑Cache Retrieval: When a request arrives, the scheduler looks up the needed blocks in the distributed KV store. Cached blocks are inserted directly into the model’s attention memory, bypassing the expensive forward pass for those tokens.
Selective Attention: For cached blocks the model skips the full self‑attention matrix (O(n²) cost). Instead, it computes a cheap “correction” attention only on the boundary between cached and new tokens, ensuring the context is still correctly integrated.
Global Scheduling: An affinity‑based router monitors cache hit rates and moves hot items between shards to keep locality high, reducing cross‑node communication.

All of this is orchestrated as a micro‑service that can be dropped into existing LLM serving stacks (e.g., TensorRT‑LLM, vLLM) with minimal code changes.

Results & Findings

Metric	Baseline (Prefix Cache)	RcLLM	Speed‑up
TTFT (average)	120 ms	13 ms – 92 ms	1.31× – 9.51×
Top‑K Recommendation Accuracy (HR@10)	0.742	0.739	≈ 0.4 % drop
Cache Hit Ratio (user‑history)	68 %	100 % (replicated)	–
Cache Hit Ratio (item)	22 %	55 % (similarity‑aware sharding)	–

Key takeaways

Latency: The biggest win comes from eliminating repeated attention over long user histories and item texts.
Accuracy: The selective attention correction keeps the generative output within the noise margin of the baseline.
Scalability: The system scales linearly with catalog size because item shards are added without reshuffling the entire cache.

Practical Implications

Real‑Time Personalization: E‑commerce and streaming platforms can now serve LLM‑generated product or content recommendations within the sub‑100 ms window required for interactive UI experiences.
Cost Efficiency: By reusing KV blocks, GPU compute per request drops dramatically, lowering inference cost on cloud GPU fleets.
Plug‑and‑Play Deployment: RcLLM’s architecture is compatible with existing serving frameworks, meaning teams can adopt it without a full rewrite of their recommendation pipeline.
Extensibility: The block‑level caching idea can be applied to other LLM‑driven services that involve repetitive context—e.g., code completion with project‑wide imports, or chatbots with long conversation histories.

Limitations & Future Work

Cold‑Start Items: New items that haven’t been embedded and sharded yet miss the cache, incurring full attention cost until they become hot.
Cache Management Overhead: The affinity scheduler adds bookkeeping traffic; in extremely high‑throughput scenarios this could become a bottleneck.
Model‑Specific Tuning: The selective attention correction was tuned for decoder‑only transformers; adapting it to encoder‑decoder or retrieval‑augmented models may need extra research.
Future Directions: The authors suggest exploring hierarchical caching (e.g., caching at the phrase level), integrating learned cache replacement policies, and extending the system to multi‑modal recommendation (text + image).

Authors

Zhan Zhao
Yuxin Wang
Amelie Chi Zhou

Paper Information

arXiv ID: 2605.07443v1
Categories: cs.DC
Published: May 8, 2026
PDF: Download PDF

[Paper] RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole