[Paper] Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

Published: (December 17, 2025 at 10:45 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16134v1

Overview

The paper tackles a subtle but costly inefficiency in modern large‑language‑model (LLM) serving stacks that split work between data‑parallel (DP) and expert‑parallel (EP) stages. In these “DP+EP” pipelines, sending each request straight to the model creates internal queuing “bubbles” that slow down the time‑to‑first‑token (TTFT)—the latency users feel the most. The authors propose Staggered Batch Scheduling (SBS), a lightweight buffering strategy that deliberately delays requests just enough to assemble well‑packed batches, while also redistributing load across DP replicas. Their production‑grade experiments on a H800‑GPU cluster serving Deepseek‑V3 show 30‑40 % lower TTFT and 15‑20 % higher throughput versus conventional immediate‑dispatch schedulers.

Key Contributions

  • Staggered Batch Scheduling (SBS): a simple “hold‑and‑release” mechanism that buffers incoming queries to create optimal batch sizes for the DP+EP pipeline, eliminating intra‑engine queuing bubbles.
  • Load‑Aware Global Allocation: a dynamic, system‑wide load‑balancing policy that spreads both prefill (prompt processing) and decode (token‑by‑token generation) work across DP replicas, preventing hotspots.
  • Real‑world deployment: integration of SBS and the load‑aware allocator into a production‑grade Deepseek‑V3 serving stack on a 64‑GPU H800 cluster, demonstrating measurable latency and throughput gains.
  • Comprehensive evaluation: extensive micro‑benchmarks and end‑to‑end user‑facing tests that compare SBS against state‑of‑the‑art immediate‑dispatch schedulers under realistic traffic patterns.
  • Open‑source insights: the authors release detailed design diagrams, scheduling algorithms, and profiling scripts to aid reproducibility and adoption.

Methodology

  1. Problem Characterization

    • The authors first profile a typical DP+EP serving pipeline (prefill on DP, expert routing on EP, decode back on DP).
    • They discover that immediate request dispatch creates asynchronous “bubbles”: some DP workers sit idle while waiting for EP stages to finish, inflating TTFT.
  2. Staggered Batch Scheduling (SBS)

    • Incoming requests are placed in a tiny time‑window buffer (e.g., 5–20 ms).
    • When the buffer fills or the window expires, the scheduler forms a batch that matches the optimal tensor shapes for the DP and EP kernels.
    • The batch is then dispatched atomically, guaranteeing that all DP workers start together, eliminating internal queuing.
  3. Load‑Aware Global Allocation

    • The system maintains a global load map of DP replicas (prefill and decode workloads).
    • When forming a batch, the allocator picks the DP replica with the lowest projected load, balancing both phases.
    • The algorithm runs in O(1) per request, making it suitable for high‑throughput environments.
  4. Implementation & Deployment

    • Integrated into the DeepSpeed‑Inference stack, with minimal code changes (~200 LOC).
    • Deployed on a H800 GPU cluster (8 × 8 = 64 GPUs) serving the 7‑B‑parameter Deepseek‑V3 model.
    • Traffic is generated using a realistic mix of short prompts (≤ 64 tokens) and long generation requests (≤ 1024 tokens).
  5. Evaluation

    • Metrics: TTFT, overall latency, throughput (tokens / sec), and GPU utilization.
    • Baselines: immediate‑dispatch scheduler, and a naive “fixed‑size batch” scheduler.
    • Experiments run for 48 h to capture diurnal load variations.

Results & Findings

MetricImmediate DispatchFixed‑Size BatchStaggered Batch (SBS)
TTFT reduction30 %–40 %
Throughput gain15 %–20 %
GPU utilization (avg)68 %73 %81 %
99th‑percentile latency1.8 s1.5 s1.1 s
  • TTFT improves because all DP workers start the prefill together, avoiding the “wait‑for‑expert” stalls that dominate early latency.
  • Throughput rises as the scheduler eliminates idle periods, allowing the EP stage to stay fully occupied.
  • The load‑aware allocator prevents a single DP replica from becoming a bottleneck, especially during decode‑heavy workloads.
  • Profiling shows a 30 % reduction in intra‑engine queue depth, confirming the core hypothesis.

Practical Implications

  • LLM SaaS providers can adopt SBS with only a few lines of code to cut user‑perceived latency, a key competitive differentiator.
  • Edge‑to‑cloud inference pipelines that already use DP+EP (e.g., MoE models) can benefit without hardware changes—SBS works purely at the scheduler level.
  • Cost efficiency: higher GPU utilization translates to lower per‑token inference cost, enabling cheaper pricing or higher request volumes on the same hardware budget.
  • Developer ergonomics: the approach is framework‑agnostic; the authors demonstrate integration with DeepSpeed, but the same ideas apply to TensorRT‑LLM, vLLM, or custom inference servers.
  • Latency‑sensitive applications (chatbots, code assistants, real‑time translation) gain a smoother user experience because the first token appears faster, even under bursty traffic.

Limitations & Future Work

  • Buffering trade‑off: SBS introduces a small, configurable delay (few milliseconds). In ultra‑low‑latency scenarios (< 5 ms), this could be noticeable.
  • Model‑specific tuning: Optimal buffer window and batch size depend on model size, token length distribution, and hardware; the paper provides heuristics but not an automated tuner.
  • Scalability beyond a single cluster: The current global load map assumes a shared control plane; extending to multi‑region or multi‑cloud deployments would need hierarchical scheduling.
  • Future directions suggested by the authors include:
    • Adaptive window sizing driven by live traffic statistics.
    • Integration with prefetching of expert weights for MoE models.
    • Exploration of reinforcement‑learning‑based schedulers that learn optimal batch formation policies over time.

Authors

  • Jian Tian
  • Shuailong Li
  • Yang Cao
  • Wenbo Cui
  • Minghan Zhu
  • Wenkang Wu
  • Jianming Zhang
  • Yanpeng Wang
  • Zhiwen Xiao
  • Zhenyu Hou
  • Dou Shen

Paper Information

  • arXiv ID: 2512.16134v1
  • Categories: cs.DC, cs.LG
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...