[Paper] RelayLLM: Efficient Reasoning via Collaborative Decoding
Source: arXiv - 2601.05167v1
Overview
The paper RelayLLM tackles a practical pain point in today’s AI pipelines: getting the deep reasoning power of large language models (LLMs) without paying their huge compute bill. By letting a lightweight small language model (SLM) “call” a big model only for the few tokens it really can’t handle, the authors achieve near‑LLM performance while slashing inference cost by more than 98 %.
Key Contributions
- Token‑level collaborative decoding – the SLM decides, on a per‑token basis, whether to generate itself or to issue a special “relay” command that hands control to the LLM.
- Two‑stage training pipeline – a warm‑up phase followed by Group Relative Policy Optimization (GRPO) teaches the SLM to balance autonomy with strategic help‑seeking.
- Empirical validation on six reasoning benchmarks – RelayLLM reaches an average accuracy of 49.52 %, closing most of the gap between the SLM and the LLM.
- Extreme efficiency – the LLM is invoked for only 1.07 % of generated tokens, delivering a 98.2 % reduction in compute cost versus a naïve random router that matches performance.
- Generalizable framework – the relay mechanism can be plugged into any existing SLM/LLM pair without architectural changes to the models themselves.
Methodology
- Architecture – The system consists of an SLM (e.g., a 7B‑parameter model) and a much larger LLM (e.g., GPT‑3.5‑turbo). The SLM runs the main decoding loop. When it predicts a “relay token,” the decoder pauses, sends the current context to the LLM, and inserts the LLM’s next token(s) into the output stream.
- Training Phase 1: Warm‑up – Both models are first fine‑tuned on the target reasoning tasks using standard supervised learning, ensuring they can solve the problems independently.
- Training Phase 2: GRPO – The SLM’s policy (when to emit a relay token) is optimized with a reinforcement‑learning style objective. GRPO groups tokens into “critical” and “non‑critical” sets and rewards the SLM for:
- Correctly handling easy tokens on its own (reducing reliance on the LLM).
- Calling the LLM on truly hard tokens (improving overall answer quality).
The loss balances task accuracy, relay frequency, and a penalty for unnecessary LLM calls.
- Inference – At runtime, the SLM generates token‑by‑token. If it emits a relay command, the LLM instantly supplies the next token(s); otherwise, the SLM continues autonomously. This fine‑grained hand‑off eliminates the “all‑or‑nothing” routing used in prior work.
Results & Findings
| Benchmark (6 total) | SLM‑only Acc. | LLM‑only Acc. | RelayLLM Acc. | % Tokens Relayed |
|---|---|---|---|---|
| Avg. | ~30 % | ~55 % | 49.52 % | 1.07 % |
- Accuracy boost: RelayLLM consistently outperforms the SLM by ~20 % absolute and narrows the gap to the LLM by roughly 10 % points.
- Cost savings: Because the LLM is only consulted for ~1 % of tokens, the total FLOPs per query drop to ~1.8 % of a full LLM run, matching the “performance‑matched random router” baseline’s cost reduction of 98.2 %.
- Robustness: Ablation studies show that removing GRPO or limiting the relay token vocabulary degrades both accuracy and efficiency, confirming the importance of the two‑stage training.
Practical Implications
- Production‑grade AI services – Companies can deploy a cheap SLM at the edge (e.g., on a serverless function) and fall back to a cloud‑hosted LLM only for the hardest reasoning steps, dramatically lowering latency and API costs.
- Developer tooling – IDE assistants, code reviewers, or chatbots can stay responsive by handling most tokens locally and invoking a powerful model only when a “stuck” token is detected (e.g., a complex logical inference).
- Energy‑aware AI – Reducing LLM token usage translates directly into lower power consumption, aligning with sustainability goals for large‑scale inference workloads.
- Modular integration – Since RelayLLM works at the decoding level, it can be added on top of any existing SLM/LLM pair without retraining the underlying language models from scratch, easing adoption.
Limitations & Future Work
- Relay token design – The current approach relies on a special token vocabulary; extending this to more natural “request” signals (e.g., textual prompts) may improve compatibility with off‑the‑shelf APIs.
- Scalability of GRPO – Training the relay policy with reinforcement learning can be compute‑intensive; future work could explore lighter‑weight imitation‑learning alternatives.
- Generalization to multimodal tasks – The paper focuses on pure text reasoning; applying token‑level relaying to vision‑language or audio‑text pipelines remains an open question.
- Dynamic cost budgeting – Presently the relay frequency is learned implicitly; incorporating explicit cost constraints (e.g., a budget per query) could give developers finer control over spend vs. performance trade‑offs.
Authors
- Chengsong Huang
- Tong Zheng
- Langlin Huang
- Jinyuan Li
- Haolin Liu
- Jiaxin Huang
Paper Information
- arXiv ID: 2601.05167v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: January 8, 2026
- PDF: Download PDF