[Paper] RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

Published: (January 4, 2026 at 08:34 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.01712v1

Overview

The paper introduces RelayGR, a production‑grade system that lets modern generative recommendation models consume much longer user‑behavior histories without breaking the tight latency budgets of real‑time ranking. By pre‑computing the “user prefix” of a sequence and keeping it hot in high‑bandwidth memory (HBM), RelayGR can serve longer inputs and boost throughput while staying within the strict P99 latency service‑level objectives (SLOs) that power large‑scale recommendation pipelines.

Key Contributions

  • Cross‑stage prefix pre‑inference: Shows that most tokens in a generative recommendation (GR) sequence are independent of the candidate items, enabling a reusable prefix that can be computed ahead of the final ranking stage.
  • Sequence‑aware trigger: A lightweight admission controller that decides, per request, whether to pre‑infer the prefix based on cache pressure and expected latency impact.
  • Affinity‑aware router: Guarantees that the pre‑inferred prefix and the subsequent ranking request hit the same server instance, eliminating costly remote fetches.
  • Memory‑aware expander: Leverages server‑local DRAM as a secondary cache to capture short‑term reuse across requests while keeping the primary KV cache resident in HBM.
  • Industrial‑scale implementation: Deployed on Huawei Ascend NPUs, demonstrating up to 1.5× longer effective sequence lengths and 3.6× SLO‑compliant throughput gains.

Methodology

  1. Problem Framing – The authors profile a typical multi‑stage recommendation flow (retrieval → pre‑processing → fine‑grained ranking) and identify that the ranking stage has only a few tens of milliseconds to run a GR model, forcing a hard cap on input length.
  2. Prefix Isolation – By analyzing token dependencies, they separate the user‑behavior prefix (candidate‑agnostic) from the candidate‑specific suffix. The prefix can be computed once per user session and reused for every candidate examined later.
  3. System Design
    • Trigger monitors request rates and cache occupancy; it flags “at‑risk” requests that would exceed the latency budget if the full sequence were processed on‑the‑fly.
    • Router uses a consistent‑hashing scheme to steer both the pre‑inference job and the later ranking request to the same NPU instance, ensuring the KV cache stays local.
    • Expander maintains a DRAM‑resident copy of recently used prefixes, allowing fast warm‑up for new ranking instances without re‑computing the prefix.
  4. Implementation – The pipeline is built on top of the Ascend NPU runtime, exploiting its HBM for the KV cache and integrating with the existing recommendation service stack.

Results & Findings

MetricBaseline (no RelayGR)RelayGR
Max usable sequence length (tokens)~200~300 (≈ 1.5×)
P99 ranking latency (ms)28≤ 28 (unchanged)
SLO‑compliant throughput (queries/s)1.0× (baseline)up to 3.6×
KV‑cache hit rate (prefix)0 %92 % (average)
  • Latency stays within the same P99 bound because the heavy prefix work is moved off the critical path.
  • Throughput scales dramatically as the ranking stage now processes far fewer tokens per request.
  • Cache efficiency is high thanks to the affinity‑aware routing; most ranking requests find their prefix already resident in HBM.

Practical Implications

  • Longer user histories: Developers can feed richer behavioral context into generative recommenders, improving personalization without sacrificing latency.
  • Cost‑effective scaling: By reusing prefixes, the system reduces compute cycles per query, allowing existing hardware to handle higher QPS or to lower power consumption.
  • Simplified model engineering: Teams can keep a single, large GR model rather than maintaining separate “short‑sequence” variants for production.
  • Generalizable pattern: The relay‑race inference concept can be applied to other latency‑sensitive generative tasks (e.g., next‑word prediction, code completion) where a large portion of the input is static across downstream calls.

Limitations & Future Work

  • Cache footprint: Even with HBM, the KV cache for millions of active users can exceed memory limits; the current trigger only approximates optimal eviction.
  • Cold‑start latency: First‑time users still incur the full inference cost; the paper suggests but does not implement a warm‑up predictor.
  • Hardware dependence: The solution is tightly coupled to Ascend NPUs and their HBM architecture; porting to GPUs or CPUs may require redesign of the memory‑aware expander.
  • Extending beyond recommendation: Future research could explore applying the relay‑race paradigm to multimodal generative models or to scenarios with dynamic candidate sets that change rapidly.

RelayGR demonstrates that clever system‑level engineering—splitting static and dynamic parts of a generative model’s input and keeping the static part hot in memory—can unlock the full potential of long‑sequence recommendation models in production. For developers building real‑time AI services, the paper offers a concrete blueprint for balancing model expressiveness with the hard latency guarantees that users expect.

Authors

  • Jiarui Wang
  • Huichao Chai
  • Yuanhang Zhang
  • Zongjin Zhou
  • Wei Guo
  • Xingkun Yang
  • Qiang Tang
  • Bo Pan
  • Jiawei Zhu
  • Ke Cheng
  • Yuting Yan
  • Shulan Wang
  • Yingjie Zhu
  • Zhengfan Yuan
  • Jiaqi Huang
  • Yuhan Zhang
  • Xiaosong Sun
  • Zhinan Zhang
  • Hong Zhu
  • Yongsheng Zhang
  • Tiantian Dong
  • Zhong Xiao
  • Deliang Liu
  • Chengzhou Lu
  • Yuan Sun
  • Zhiyuan Chen
  • Xinming Han
  • Zaizhu Liu
  • Yaoyuan Wang
  • Ziyang Zhang
  • Yong Liu
  • Jinxin Xu
  • Yajing Sun
  • Zhoujun Yu
  • Wenting Zhou
  • Qidong Zhang
  • Zhengyong Zhang
  • Zhonghai Gu
  • Yibo Jin
  • Yongxiang Feng
  • Pengfei Zuo

Paper Information

  • arXiv ID: 2601.01712v1
  • Categories: cs.DC, cs.AI, cs.LG
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »