[Paper] Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference
Source: arXiv - 2512.16134v1
Overview
The paper tackles a subtle but costly inefficiency in modern large‑language‑model (LLM) serving stacks that split work between data‑parallel (DP) and expert‑parallel (EP) stages. In these “DP+EP” pipelines, sending each request straight to the model creates internal queuing “bubbles” that slow down the time‑to‑first‑token (TTFT)—the latency users feel the most. The authors propose Staggered Batch Scheduling (SBS), a lightweight buffering strategy that deliberately delays requests just enough to assemble well‑packed batches, while also redistributing load across DP replicas. Their production‑grade experiments on a H800‑GPU cluster serving Deepseek‑V3 show 30‑40 % lower TTFT and 15‑20 % higher throughput versus conventional immediate‑dispatch schedulers.
Key Contributions
- Staggered Batch Scheduling (SBS): a simple “hold‑and‑release” mechanism that buffers incoming queries to create optimal batch sizes for the DP+EP pipeline, eliminating intra‑engine queuing bubbles.
- Load‑Aware Global Allocation: a dynamic, system‑wide load‑balancing policy that spreads both prefill (prompt processing) and decode (token‑by‑token generation) work across DP replicas, preventing hotspots.
- Real‑world deployment: integration of SBS and the load‑aware allocator into a production‑grade Deepseek‑V3 serving stack on a 64‑GPU H800 cluster, demonstrating measurable latency and throughput gains.
- Comprehensive evaluation: extensive micro‑benchmarks and end‑to‑end user‑facing tests that compare SBS against state‑of‑the‑art immediate‑dispatch schedulers under realistic traffic patterns.
- Open‑source insights: the authors release detailed design diagrams, scheduling algorithms, and profiling scripts to aid reproducibility and adoption.
Methodology
-
Problem Characterization
- The authors first profile a typical DP+EP serving pipeline (prefill on DP, expert routing on EP, decode back on DP).
- They discover that immediate request dispatch creates asynchronous “bubbles”: some DP workers sit idle while waiting for EP stages to finish, inflating TTFT.
-
Staggered Batch Scheduling (SBS)
- Incoming requests are placed in a tiny time‑window buffer (e.g., 5–20 ms).
- When the buffer fills or the window expires, the scheduler forms a batch that matches the optimal tensor shapes for the DP and EP kernels.
- The batch is then dispatched atomically, guaranteeing that all DP workers start together, eliminating internal queuing.
-
Load‑Aware Global Allocation
- The system maintains a global load map of DP replicas (prefill and decode workloads).
- When forming a batch, the allocator picks the DP replica with the lowest projected load, balancing both phases.
- The algorithm runs in O(1) per request, making it suitable for high‑throughput environments.
-
Implementation & Deployment
- Integrated into the DeepSpeed‑Inference stack, with minimal code changes (~200 LOC).
- Deployed on a H800 GPU cluster (8 × 8 = 64 GPUs) serving the 7‑B‑parameter Deepseek‑V3 model.
- Traffic is generated using a realistic mix of short prompts (≤ 64 tokens) and long generation requests (≤ 1024 tokens).
-
Evaluation
- Metrics: TTFT, overall latency, throughput (tokens / sec), and GPU utilization.
- Baselines: immediate‑dispatch scheduler, and a naive “fixed‑size batch” scheduler.
- Experiments run for 48 h to capture diurnal load variations.
Results & Findings
| Metric | Immediate Dispatch | Fixed‑Size Batch | Staggered Batch (SBS) |
|---|---|---|---|
| TTFT reduction | – | – | 30 %–40 % |
| Throughput gain | – | – | 15 %–20 % |
| GPU utilization (avg) | 68 % | 73 % | 81 % |
| 99th‑percentile latency | 1.8 s | 1.5 s | 1.1 s |
- TTFT improves because all DP workers start the prefill together, avoiding the “wait‑for‑expert” stalls that dominate early latency.
- Throughput rises as the scheduler eliminates idle periods, allowing the EP stage to stay fully occupied.
- The load‑aware allocator prevents a single DP replica from becoming a bottleneck, especially during decode‑heavy workloads.
- Profiling shows a 30 % reduction in intra‑engine queue depth, confirming the core hypothesis.
Practical Implications
- LLM SaaS providers can adopt SBS with only a few lines of code to cut user‑perceived latency, a key competitive differentiator.
- Edge‑to‑cloud inference pipelines that already use DP+EP (e.g., MoE models) can benefit without hardware changes—SBS works purely at the scheduler level.
- Cost efficiency: higher GPU utilization translates to lower per‑token inference cost, enabling cheaper pricing or higher request volumes on the same hardware budget.
- Developer ergonomics: the approach is framework‑agnostic; the authors demonstrate integration with DeepSpeed, but the same ideas apply to TensorRT‑LLM, vLLM, or custom inference servers.
- Latency‑sensitive applications (chatbots, code assistants, real‑time translation) gain a smoother user experience because the first token appears faster, even under bursty traffic.
Limitations & Future Work
- Buffering trade‑off: SBS introduces a small, configurable delay (few milliseconds). In ultra‑low‑latency scenarios (< 5 ms), this could be noticeable.
- Model‑specific tuning: Optimal buffer window and batch size depend on model size, token length distribution, and hardware; the paper provides heuristics but not an automated tuner.
- Scalability beyond a single cluster: The current global load map assumes a shared control plane; extending to multi‑region or multi‑cloud deployments would need hierarchical scheduling.
- Future directions suggested by the authors include:
- Adaptive window sizing driven by live traffic statistics.
- Integration with prefetching of expert weights for MoE models.
- Exploration of reinforcement‑learning‑based schedulers that learn optimal batch formation policies over time.
Authors
- Jian Tian
- Shuailong Li
- Yang Cao
- Wenbo Cui
- Minghan Zhu
- Wenkang Wu
- Jianming Zhang
- Yanpeng Wang
- Zhiwen Xiao
- Zhenyu Hou
- Dou Shen
Paper Information
- arXiv ID: 2512.16134v1
- Categories: cs.DC, cs.LG
- Published: December 18, 2025
- PDF: Download PDF