[Paper] TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

Published: 2 months ago (December 2, 2025 at 10:45 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.03416v1

Overview

The paper presents TokenScale, a new autoscaling framework designed for the next‑generation “prefill‑decode” (PD) disaggregated serving of large language models (LLMs). By introducing a forward‑looking metric called Token Velocity and a flexible hardware primitive named Convertible Decoders, TokenScale can react to traffic bursts far faster than existing solutions, cutting latency violations and saving compute costs.

Key Contributions

Token Velocity metric – a unified, fine‑grained indicator that captures work across the prefill, network, and decode stages, acting as an early warning signal for overload.
Convertible Decoders – a hardware‑aware design that lets decoder GPUs temporarily take on prefill work during spikes, eliminating the warm‑up delay of provisioning new prefill nodes.
Predictive autoscaling policy – combines Token Velocity with a lightweight controller to scale resources proactively rather than reactively.
Comprehensive evaluation – real‑world production traces on a GPU cluster show SLO compliance jumps from 50‑88 % to 80‑96 % and cost reductions of 4‑14 % versus state‑of‑the‑art systems (DistServe, BlitzScale, AIBrix).

Methodology

Metric Design – The authors instrument the PD pipeline to measure how many tokens are entering each stage per second. This “token velocity” reflects the true processing pressure, unlike GPU utilization which lags behind request bursts.
System Architecture – Decoder GPUs are equipped with a lightweight “conversion” layer that can switch from pure decode mode to a hybrid mode capable of handling prefill batches when needed.
Autoscaling Controller – A simple threshold‑based controller monitors token velocity. When the velocity exceeds a high‑water mark, it first activates convertible decoders; if the pressure persists, it spins up additional prefill workers. Scaling down follows a low‑water mark with a cooldown period to avoid thrashing.
Experimental Setup – The team replayed production request traces (including bursty traffic patterns) on a 16‑GPU cluster. Baselines included DistServe, BlitzScale, and AIBrix, each configured with their recommended policies. Metrics collected were TTFT, TPOT, SLO attainment, and total GPU‑hour cost.

Results & Findings

Metric	DistServe / BlitzScale / AIBrix	TokenScale
SLO attainment (TTFT + TPOT)	50 % – 88 %	80 % – 96 %
Average TTFT	1.8 s	1.2 s
Average TPOT	0.45 s/token	0.33 s/token
GPU‑hour cost	Baseline	‑4 % to ‑14 %

Token Velocity reacts within milliseconds to a burst, triggering convertible decoders almost instantly.
Convertible decoders absorb up to ~30 % of peak traffic without needing to launch new prefill nodes.
The proactive scaling reduces queue buildup, which directly translates into lower TTFT and TPOT.

Practical Implications

Lower Latency for End‑Users – Services that expose LLM APIs (e.g., chat assistants, code generation tools) can meet tighter latency SLAs, improving user experience.
Cost‑Effective Scaling – Cloud operators can run fewer idle prefill instances, relying on the more abundant decoder GPUs for burst handling, which reduces overall GPU‑hour spend.
Simplified Ops – Token Velocity is easy to instrument and does not require deep hardware counters, making it suitable for heterogeneous clusters (NVIDIA, AMD, or even emerging accelerator fabrics).
Portability – The convertible decoder concept can be implemented as a software shim on existing inference runtimes (e.g., vLLM, TensorRT‑LLM), enabling incremental adoption without hardware redesign.

Limitations & Future Work

Hardware Dependency – Convertible decoders assume decoder GPUs have enough spare compute to handle prefill workloads; on heavily loaded decode‑only workloads the benefit may diminish.
Metric Sensitivity – Token Velocity thresholds need tuning per model size and batch pattern; an automated calibration routine is not yet provided.
Multi‑Tenant Scenarios – The paper focuses on a single‑tenant workload; extending the approach to multi‑tenant clusters with fairness guarantees remains an open challenge.
Future Directions – The authors plan to explore adaptive threshold learning (e.g., reinforcement learning) and to integrate token‑level priority scheduling for mixed‑priority requests.

Authors

Ruiqi Lai
Hongrui Liu
Chengzhi Lu
Zonghao Liu
Siyu Cao
Siyang Shao
Yixin Zhang
Luo Mai
Dmitrii Ustiugov

Paper Information

arXiv ID: 2512.03416v1
Categories: cs.DC
Published: December 3, 2025
PDF: Download PDF

[Paper] TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity