[Paper] VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Published: 3 days ago (June 10, 2026 at 11:45 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.12243v1

Overview

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/

Key Contributions

This paper presents research in the following areas:

cs.CL
cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Yuchen Xian
Yang He
Yunqiu Xu
Yi Yang

Paper Information

arXiv ID: 2606.12243v1
Categories: cs.CL, cs.AI
Published: June 10, 2026
PDF: Download PDF

[Paper] VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

[Paper] Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

[Paper] SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation