[Paper] Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

Published: 1 month ago (December 17, 2025 at 10:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16134v1

Overview

The paper tackles a subtle but costly inefficiency in modern large‑language‑model (LLM) serving stacks that split work between data‑parallel (DP) and expert‑parallel (EP) stages. In these “DP+EP” pipelines, sending each request straight to the model creates internal queuing “bubbles” that slow down the time‑to‑first‑token (TTFT)—the latency users feel the most. The authors propose Staggered Batch Scheduling (SBS), a lightweight buffering strategy that deliberately delays requests just enough to assemble well‑packed batches, while also redistributing load across DP replicas. Their production‑grade experiments on a H800‑GPU cluster serving Deepseek‑V3 show 30‑40 % lower TTFT and 15‑20 % higher throughput versus conventional immediate‑dispatch schedulers.

Key Contributions

Staggered Batch Scheduling (SBS): a simple “hold‑and‑release” mechanism that buffers incoming queries to create optimal batch sizes for the DP+EP pipeline, eliminating intra‑engine queuing bubbles.
Load‑Aware Global Allocation: a dynamic, system‑wide load‑balancing policy that spreads both prefill (prompt processing) and decode (token‑by‑token generation) work across DP replicas, preventing hotspots.
Real‑world deployment: integration of SBS and the load‑aware allocator into a production‑grade Deepseek‑V3 serving stack on a 64‑GPU H800 cluster, demonstrating measurable latency and throughput gains.
Comprehensive evaluation: extensive micro‑benchmarks and end‑to‑end user‑facing tests that compare SBS against state‑of‑the‑art immediate‑dispatch schedulers under realistic traffic patterns.
Open‑source insights: the authors release detailed design diagrams, scheduling algorithms, and profiling scripts to aid reproducibility and adoption.

Methodology

Problem Characterization
- The authors first profile a typical DP+EP serving pipeline (prefill on DP, expert routing on EP, decode back on DP).
- They discover that immediate request dispatch creates asynchronous “bubbles”: some DP workers sit idle while waiting for EP stages to finish, inflating TTFT.
Staggered Batch Scheduling (SBS)
- Incoming requests are placed in a tiny time‑window buffer (e.g., 5–20 ms).
- When the buffer fills or the window expires, the scheduler forms a batch that matches the optimal tensor shapes for the DP and EP kernels.
- The batch is then dispatched atomically, guaranteeing that all DP workers start together, eliminating internal queuing.
Load‑Aware Global Allocation
- The system maintains a global load map of DP replicas (prefill and decode workloads).
- When forming a batch, the allocator picks the DP replica with the lowest projected load, balancing both phases.
- The algorithm runs in O(1) per request, making it suitable for high‑throughput environments.
Implementation & Deployment
- Integrated into the DeepSpeed‑Inference stack, with minimal code changes (~200 LOC).
- Deployed on a H800 GPU cluster (8 × 8 = 64 GPUs) serving the 7‑B‑parameter Deepseek‑V3 model.
- Traffic is generated using a realistic mix of short prompts (≤ 64 tokens) and long generation requests (≤ 1024 tokens).
Evaluation
- Metrics: TTFT, overall latency, throughput (tokens / sec), and GPU utilization.
- Baselines: immediate‑dispatch scheduler, and a naive “fixed‑size batch” scheduler.
- Experiments run for 48 h to capture diurnal load variations.

Results & Findings

Metric	Immediate Dispatch	Fixed‑Size Batch	Staggered Batch (SBS)
TTFT reduction	–	–	30 %–40 %
Throughput gain	–	–	15 %–20 %
GPU utilization (avg)	68 %	73 %	81 %
99th‑percentile latency	1.8 s	1.5 s	1.1 s

TTFT improves because all DP workers start the prefill together, avoiding the “wait‑for‑expert” stalls that dominate early latency.
Throughput rises as the scheduler eliminates idle periods, allowing the EP stage to stay fully occupied.
The load‑aware allocator prevents a single DP replica from becoming a bottleneck, especially during decode‑heavy workloads.
Profiling shows a 30 % reduction in intra‑engine queue depth, confirming the core hypothesis.

Practical Implications

LLM SaaS providers can adopt SBS with only a few lines of code to cut user‑perceived latency, a key competitive differentiator.
Edge‑to‑cloud inference pipelines that already use DP+EP (e.g., MoE models) can benefit without hardware changes—SBS works purely at the scheduler level.
Cost efficiency: higher GPU utilization translates to lower per‑token inference cost, enabling cheaper pricing or higher request volumes on the same hardware budget.
Developer ergonomics: the approach is framework‑agnostic; the authors demonstrate integration with DeepSpeed, but the same ideas apply to TensorRT‑LLM, vLLM, or custom inference servers.
Latency‑sensitive applications (chatbots, code assistants, real‑time translation) gain a smoother user experience because the first token appears faster, even under bursty traffic.

Limitations & Future Work

Buffering trade‑off: SBS introduces a small, configurable delay (few milliseconds). In ultra‑low‑latency scenarios (< 5 ms), this could be noticeable.
Model‑specific tuning: Optimal buffer window and batch size depend on model size, token length distribution, and hardware; the paper provides heuristics but not an automated tuner.
Scalability beyond a single cluster: The current global load map assumes a shared control plane; extending to multi‑region or multi‑cloud deployments would need hierarchical scheduling.
Future directions suggested by the authors include:
- Adaptive window sizing driven by live traffic statistics.
- Integration with prefetching of expert weights for MoE models.
- Exploration of reinforcement‑learning‑based schedulers that learn optimal batch formation policies over time.

Authors

Jian Tian
Shuailong Li
Yang Cao
Wenbo Cui
Minghan Zhu
Wenkang Wu
Jianming Zhang
Yanpeng Wang
Zhiwen Xiao
Zhenyu Hou
Dou Shen

Paper Information

arXiv ID: 2512.16134v1
Categories: cs.DC, cs.LG
Published: December 18, 2025
PDF: Download PDF

[Paper] Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy