[Paper] TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
Source: arXiv - 2512.03416v1
Overview
The paper presents TokenScale, a new autoscaling framework designed for the next‑generation “prefill‑decode” (PD) disaggregated serving of large language models (LLMs). By introducing a forward‑looking metric called Token Velocity and a flexible hardware primitive named Convertible Decoders, TokenScale can react to traffic bursts far faster than existing solutions, cutting latency violations and saving compute costs.
Key Contributions
- Token Velocity metric – a unified, fine‑grained indicator that captures work across the prefill, network, and decode stages, acting as an early warning signal for overload.
- Convertible Decoders – a hardware‑aware design that lets decoder GPUs temporarily take on prefill work during spikes, eliminating the warm‑up delay of provisioning new prefill nodes.
- Predictive autoscaling policy – combines Token Velocity with a lightweight controller to scale resources proactively rather than reactively.
- Comprehensive evaluation – real‑world production traces on a GPU cluster show SLO compliance jumps from 50‑88 % to 80‑96 % and cost reductions of 4‑14 % versus state‑of‑the‑art systems (DistServe, BlitzScale, AIBrix).
Methodology
- Metric Design – The authors instrument the PD pipeline to measure how many tokens are entering each stage per second. This “token velocity” reflects the true processing pressure, unlike GPU utilization which lags behind request bursts.
- System Architecture – Decoder GPUs are equipped with a lightweight “conversion” layer that can switch from pure decode mode to a hybrid mode capable of handling prefill batches when needed.
- Autoscaling Controller – A simple threshold‑based controller monitors token velocity. When the velocity exceeds a high‑water mark, it first activates convertible decoders; if the pressure persists, it spins up additional prefill workers. Scaling down follows a low‑water mark with a cooldown period to avoid thrashing.
- Experimental Setup – The team replayed production request traces (including bursty traffic patterns) on a 16‑GPU cluster. Baselines included DistServe, BlitzScale, and AIBrix, each configured with their recommended policies. Metrics collected were TTFT, TPOT, SLO attainment, and total GPU‑hour cost.
Results & Findings
| Metric | DistServe / BlitzScale / AIBrix | TokenScale |
|---|---|---|
| SLO attainment (TTFT + TPOT) | 50 % – 88 % | 80 % – 96 % |
| Average TTFT | 1.8 s | 1.2 s |
| Average TPOT | 0.45 s/token | 0.33 s/token |
| GPU‑hour cost | Baseline | ‑4 % to ‑14 % |
- Token Velocity reacts within milliseconds to a burst, triggering convertible decoders almost instantly.
- Convertible decoders absorb up to ~30 % of peak traffic without needing to launch new prefill nodes.
- The proactive scaling reduces queue buildup, which directly translates into lower TTFT and TPOT.
Practical Implications
- Lower Latency for End‑Users – Services that expose LLM APIs (e.g., chat assistants, code generation tools) can meet tighter latency SLAs, improving user experience.
- Cost‑Effective Scaling – Cloud operators can run fewer idle prefill instances, relying on the more abundant decoder GPUs for burst handling, which reduces overall GPU‑hour spend.
- Simplified Ops – Token Velocity is easy to instrument and does not require deep hardware counters, making it suitable for heterogeneous clusters (NVIDIA, AMD, or even emerging accelerator fabrics).
- Portability – The convertible decoder concept can be implemented as a software shim on existing inference runtimes (e.g., vLLM, TensorRT‑LLM), enabling incremental adoption without hardware redesign.
Limitations & Future Work
- Hardware Dependency – Convertible decoders assume decoder GPUs have enough spare compute to handle prefill workloads; on heavily loaded decode‑only workloads the benefit may diminish.
- Metric Sensitivity – Token Velocity thresholds need tuning per model size and batch pattern; an automated calibration routine is not yet provided.
- Multi‑Tenant Scenarios – The paper focuses on a single‑tenant workload; extending the approach to multi‑tenant clusters with fairness guarantees remains an open challenge.
- Future Directions – The authors plan to explore adaptive threshold learning (e.g., reinforcement learning) and to integrate token‑level priority scheduling for mixed‑priority requests.
Authors
- Ruiqi Lai
- Hongrui Liu
- Chengzhi Lu
- Zonghao Liu
- Siyu Cao
- Siyang Shao
- Yixin Zhang
- Luo Mai
- Dmitrii Ustiugov
Paper Information
- arXiv ID: 2512.03416v1
- Categories: cs.DC
- Published: December 3, 2025
- PDF: Download PDF