[Paper] RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

Published: 2 weeks ago (January 4, 2026 at 08:34 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.01712v1

Overview

The paper introduces RelayGR, a production‑grade system that lets modern generative recommendation models consume much longer user‑behavior histories without breaking the tight latency budgets of real‑time ranking. By pre‑computing the “user prefix” of a sequence and keeping it hot in high‑bandwidth memory (HBM), RelayGR can serve longer inputs and boost throughput while staying within the strict P99 latency service‑level objectives (SLOs) that power large‑scale recommendation pipelines.

Key Contributions

Cross‑stage prefix pre‑inference: Shows that most tokens in a generative recommendation (GR) sequence are independent of the candidate items, enabling a reusable prefix that can be computed ahead of the final ranking stage.
Sequence‑aware trigger: A lightweight admission controller that decides, per request, whether to pre‑infer the prefix based on cache pressure and expected latency impact.
Affinity‑aware router: Guarantees that the pre‑inferred prefix and the subsequent ranking request hit the same server instance, eliminating costly remote fetches.
Memory‑aware expander: Leverages server‑local DRAM as a secondary cache to capture short‑term reuse across requests while keeping the primary KV cache resident in HBM.
Industrial‑scale implementation: Deployed on Huawei Ascend NPUs, demonstrating up to 1.5× longer effective sequence lengths and 3.6× SLO‑compliant throughput gains.

Methodology

Problem Framing – The authors profile a typical multi‑stage recommendation flow (retrieval → pre‑processing → fine‑grained ranking) and identify that the ranking stage has only a few tens of milliseconds to run a GR model, forcing a hard cap on input length.
Prefix Isolation – By analyzing token dependencies, they separate the user‑behavior prefix (candidate‑agnostic) from the candidate‑specific suffix. The prefix can be computed once per user session and reused for every candidate examined later.
System Design
- Trigger monitors request rates and cache occupancy; it flags “at‑risk” requests that would exceed the latency budget if the full sequence were processed on‑the‑fly.
- Router uses a consistent‑hashing scheme to steer both the pre‑inference job and the later ranking request to the same NPU instance, ensuring the KV cache stays local.
- Expander maintains a DRAM‑resident copy of recently used prefixes, allowing fast warm‑up for new ranking instances without re‑computing the prefix.
Implementation – The pipeline is built on top of the Ascend NPU runtime, exploiting its HBM for the KV cache and integrating with the existing recommendation service stack.

Results & Findings

Metric	Baseline (no RelayGR)	RelayGR
Max usable sequence length (tokens)	~200	~300 (≈ 1.5×)
P99 ranking latency (ms)	28	≤ 28 (unchanged)
SLO‑compliant throughput (queries/s)	1.0× (baseline)	up to 3.6×
KV‑cache hit rate (prefix)	0 %	92 % (average)

Latency stays within the same P99 bound because the heavy prefix work is moved off the critical path.
Throughput scales dramatically as the ranking stage now processes far fewer tokens per request.
Cache efficiency is high thanks to the affinity‑aware routing; most ranking requests find their prefix already resident in HBM.

Practical Implications

Longer user histories: Developers can feed richer behavioral context into generative recommenders, improving personalization without sacrificing latency.
Cost‑effective scaling: By reusing prefixes, the system reduces compute cycles per query, allowing existing hardware to handle higher QPS or to lower power consumption.
Simplified model engineering: Teams can keep a single, large GR model rather than maintaining separate “short‑sequence” variants for production.
Generalizable pattern: The relay‑race inference concept can be applied to other latency‑sensitive generative tasks (e.g., next‑word prediction, code completion) where a large portion of the input is static across downstream calls.

Limitations & Future Work

Cache footprint: Even with HBM, the KV cache for millions of active users can exceed memory limits; the current trigger only approximates optimal eviction.
Cold‑start latency: First‑time users still incur the full inference cost; the paper suggests but does not implement a warm‑up predictor.
Hardware dependence: The solution is tightly coupled to Ascend NPUs and their HBM architecture; porting to GPUs or CPUs may require redesign of the memory‑aware expander.
Extending beyond recommendation: Future research could explore applying the relay‑race paradigm to multimodal generative models or to scenarios with dynamic candidate sets that change rapidly.

RelayGR demonstrates that clever system‑level engineering—splitting static and dynamic parts of a generative model’s input and keeping the static part hot in memory—can unlock the full potential of long‑sequence recommendation models in production. For developers building real‑time AI services, the paper offers a concrete blueprint for balancing model expressiveness with the hard latency guarantees that users expect.

Authors

Jiarui Wang
Huichao Chai
Yuanhang Zhang
Zongjin Zhou
Wei Guo
Xingkun Yang
Qiang Tang
Bo Pan
Jiawei Zhu
Ke Cheng
Yuting Yan
Shulan Wang
Yingjie Zhu
Zhengfan Yuan
Jiaqi Huang
Yuhan Zhang
Xiaosong Sun
Zhinan Zhang
Hong Zhu
Yongsheng Zhang
Tiantian Dong
Zhong Xiao
Deliang Liu
Chengzhou Lu
Yuan Sun
Zhiyuan Chen
Xinming Han
Zaizhu Liu
Yaoyuan Wang
Ziyang Zhang
Yong Liu
Jinxin Xu
Yajing Sun
Zhoujun Yu
Wenting Zhou
Qidong Zhang
Zhengyong Zhang
Zhonghai Gu
Yibo Jin
Yongxiang Feng
Pengfei Zuo

Paper Information

arXiv ID: 2601.01712v1
Categories: cs.DC, cs.AI, cs.LG
Published: January 5, 2026
PDF: Download PDF

[Paper] RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management