[Paper] RelayLLM: Efficient Reasoning via Collaborative Decoding

Published: 1 month ago (January 8, 2026 at 12:56 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05167v1

Overview

The paper RelayLLM tackles a practical pain point in today’s AI pipelines: getting the deep reasoning power of large language models (LLMs) without paying their huge compute bill. By letting a lightweight small language model (SLM) “call” a big model only for the few tokens it really can’t handle, the authors achieve near‑LLM performance while slashing inference cost by more than 98 %.

Key Contributions

Token‑level collaborative decoding – the SLM decides, on a per‑token basis, whether to generate itself or to issue a special “relay” command that hands control to the LLM.
Two‑stage training pipeline – a warm‑up phase followed by Group Relative Policy Optimization (GRPO) teaches the SLM to balance autonomy with strategic help‑seeking.
Empirical validation on six reasoning benchmarks – RelayLLM reaches an average accuracy of 49.52 %, closing most of the gap between the SLM and the LLM.
Extreme efficiency – the LLM is invoked for only 1.07 % of generated tokens, delivering a 98.2 % reduction in compute cost versus a naïve random router that matches performance.
Generalizable framework – the relay mechanism can be plugged into any existing SLM/LLM pair without architectural changes to the models themselves.

Methodology

Architecture – The system consists of an SLM (e.g., a 7B‑parameter model) and a much larger LLM (e.g., GPT‑3.5‑turbo). The SLM runs the main decoding loop. When it predicts a “relay token,” the decoder pauses, sends the current context to the LLM, and inserts the LLM’s next token(s) into the output stream.
Training Phase 1: Warm‑up – Both models are first fine‑tuned on the target reasoning tasks using standard supervised learning, ensuring they can solve the problems independently.
Training Phase 2: GRPO – The SLM’s policy (when to emit a relay token) is optimized with a reinforcement‑learning style objective. GRPO groups tokens into “critical” and “non‑critical” sets and rewards the SLM for:
- Correctly handling easy tokens on its own (reducing reliance on the LLM).
- Calling the LLM on truly hard tokens (improving overall answer quality).
  The loss balances task accuracy, relay frequency, and a penalty for unnecessary LLM calls.
Inference – At runtime, the SLM generates token‑by‑token. If it emits a relay command, the LLM instantly supplies the next token(s); otherwise, the SLM continues autonomously. This fine‑grained hand‑off eliminates the “all‑or‑nothing” routing used in prior work.

Results & Findings

Benchmark (6 total)	SLM‑only Acc.	LLM‑only Acc.	RelayLLM Acc.	% Tokens Relayed
Avg.	~30 %	~55 %	49.52 %	1.07 %

Accuracy boost: RelayLLM consistently outperforms the SLM by ~20 % absolute and narrows the gap to the LLM by roughly 10 % points.
Cost savings: Because the LLM is only consulted for ~1 % of tokens, the total FLOPs per query drop to ~1.8 % of a full LLM run, matching the “performance‑matched random router” baseline’s cost reduction of 98.2 %.
Robustness: Ablation studies show that removing GRPO or limiting the relay token vocabulary degrades both accuracy and efficiency, confirming the importance of the two‑stage training.

Practical Implications

Production‑grade AI services – Companies can deploy a cheap SLM at the edge (e.g., on a serverless function) and fall back to a cloud‑hosted LLM only for the hardest reasoning steps, dramatically lowering latency and API costs.
Developer tooling – IDE assistants, code reviewers, or chatbots can stay responsive by handling most tokens locally and invoking a powerful model only when a “stuck” token is detected (e.g., a complex logical inference).
Energy‑aware AI – Reducing LLM token usage translates directly into lower power consumption, aligning with sustainability goals for large‑scale inference workloads.
Modular integration – Since RelayLLM works at the decoding level, it can be added on top of any existing SLM/LLM pair without retraining the underlying language models from scratch, easing adoption.

Limitations & Future Work

Relay token design – The current approach relies on a special token vocabulary; extending this to more natural “request” signals (e.g., textual prompts) may improve compatibility with off‑the‑shelf APIs.
Scalability of GRPO – Training the relay policy with reinforcement learning can be compute‑intensive; future work could explore lighter‑weight imitation‑learning alternatives.
Generalization to multimodal tasks – The paper focuses on pure text reasoning; applying token‑level relaying to vision‑language or audio‑text pipelines remains an open question.
Dynamic cost budgeting – Presently the relay frequency is learned implicitly; incorporating explicit cost constraints (e.g., a budget per query) could give developers finer control over spend vs. performance trade‑offs.

Authors

Chengsong Huang
Tong Zheng
Langlin Huang
Jinyuan Li
Haolin Liu
Jiaxin Huang

Paper Information

arXiv ID: 2601.05167v1
Categories: cs.CL, cs.AI, cs.LG
Published: January 8, 2026
PDF: Download PDF

[Paper] RelayLLM: Efficient Reasoning via Collaborative Decoding

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

[Paper] Can We Predict Before Executing Machine Learning Agents?

[Paper] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency