[Paper] RelayLLM: Efficient Reasoning via Collaborative Decoding

Published: (January 8, 2026 at 12:56 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.05167v1

Overview

The paper RelayLLM tackles a practical pain point in today’s AI pipelines: getting the deep reasoning power of large language models (LLMs) without paying their huge compute bill. By letting a lightweight small language model (SLM) “call” a big model only for the few tokens it really can’t handle, the authors achieve near‑LLM performance while slashing inference cost by more than 98 %.

Key Contributions

  • Token‑level collaborative decoding – the SLM decides, on a per‑token basis, whether to generate itself or to issue a special “relay” command that hands control to the LLM.
  • Two‑stage training pipeline – a warm‑up phase followed by Group Relative Policy Optimization (GRPO) teaches the SLM to balance autonomy with strategic help‑seeking.
  • Empirical validation on six reasoning benchmarks – RelayLLM reaches an average accuracy of 49.52 %, closing most of the gap between the SLM and the LLM.
  • Extreme efficiency – the LLM is invoked for only 1.07 % of generated tokens, delivering a 98.2 % reduction in compute cost versus a naïve random router that matches performance.
  • Generalizable framework – the relay mechanism can be plugged into any existing SLM/LLM pair without architectural changes to the models themselves.

Methodology

  1. Architecture – The system consists of an SLM (e.g., a 7B‑parameter model) and a much larger LLM (e.g., GPT‑3.5‑turbo). The SLM runs the main decoding loop. When it predicts a “relay token,” the decoder pauses, sends the current context to the LLM, and inserts the LLM’s next token(s) into the output stream.
  2. Training Phase 1: Warm‑up – Both models are first fine‑tuned on the target reasoning tasks using standard supervised learning, ensuring they can solve the problems independently.
  3. Training Phase 2: GRPO – The SLM’s policy (when to emit a relay token) is optimized with a reinforcement‑learning style objective. GRPO groups tokens into “critical” and “non‑critical” sets and rewards the SLM for:
    • Correctly handling easy tokens on its own (reducing reliance on the LLM).
    • Calling the LLM on truly hard tokens (improving overall answer quality).
      The loss balances task accuracy, relay frequency, and a penalty for unnecessary LLM calls.
  4. Inference – At runtime, the SLM generates token‑by‑token. If it emits a relay command, the LLM instantly supplies the next token(s); otherwise, the SLM continues autonomously. This fine‑grained hand‑off eliminates the “all‑or‑nothing” routing used in prior work.

Results & Findings

Benchmark (6 total)SLM‑only Acc.LLM‑only Acc.RelayLLM Acc.% Tokens Relayed
Avg.~30 %~55 %49.52 %1.07 %
  • Accuracy boost: RelayLLM consistently outperforms the SLM by ~20 % absolute and narrows the gap to the LLM by roughly 10 % points.
  • Cost savings: Because the LLM is only consulted for ~1 % of tokens, the total FLOPs per query drop to ~1.8 % of a full LLM run, matching the “performance‑matched random router” baseline’s cost reduction of 98.2 %.
  • Robustness: Ablation studies show that removing GRPO or limiting the relay token vocabulary degrades both accuracy and efficiency, confirming the importance of the two‑stage training.

Practical Implications

  • Production‑grade AI services – Companies can deploy a cheap SLM at the edge (e.g., on a serverless function) and fall back to a cloud‑hosted LLM only for the hardest reasoning steps, dramatically lowering latency and API costs.
  • Developer tooling – IDE assistants, code reviewers, or chatbots can stay responsive by handling most tokens locally and invoking a powerful model only when a “stuck” token is detected (e.g., a complex logical inference).
  • Energy‑aware AI – Reducing LLM token usage translates directly into lower power consumption, aligning with sustainability goals for large‑scale inference workloads.
  • Modular integration – Since RelayLLM works at the decoding level, it can be added on top of any existing SLM/LLM pair without retraining the underlying language models from scratch, easing adoption.

Limitations & Future Work

  • Relay token design – The current approach relies on a special token vocabulary; extending this to more natural “request” signals (e.g., textual prompts) may improve compatibility with off‑the‑shelf APIs.
  • Scalability of GRPO – Training the relay policy with reinforcement learning can be compute‑intensive; future work could explore lighter‑weight imitation‑learning alternatives.
  • Generalization to multimodal tasks – The paper focuses on pure text reasoning; applying token‑level relaying to vision‑language or audio‑text pipelines remains an open question.
  • Dynamic cost budgeting – Presently the relay frequency is learned implicitly; incorporating explicit cost constraints (e.g., a budget per query) could give developers finer control over spend vs. performance trade‑offs.

Authors

  • Chengsong Huang
  • Tong Zheng
  • Langlin Huang
  • Jinyuan Li
  • Haolin Liu
  • Jiaxin Huang

Paper Information

  • arXiv ID: 2601.05167v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »