[Paper] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
Source: arXiv - 2512.04013v1
Overview
The paper introduces AugServe, a new inference serving framework that dramatically speeds up “augmented” large language model (LLM) workloads—LLMs that call external tools (search, calculators, APIs, etc.) during generation. By rethinking how requests are scheduled and how token batches are formed, AugServe cuts queuing delays and boosts the number of requests that can be satisfied within strict latency SLOs, a critical factor for real‑time web‑app experiences.
Key Contributions
- Two‑stage adaptive scheduling that first orders requests using static inference‑time features (e.g., expected tool calls, token length) and then continuously refines the order with live runtime metrics.
- Dynamic token‑batch sizing that reacts to current GPU/CPU load and request mix, replacing the static batch‑size limits used by existing servers.
- Comprehensive evaluation showing 4.7–33.1× higher effective throughput and up to 96 % lower time‑to‑first‑token (TTFT) compared with state‑of‑the‑art serving stacks such as vLLM and InferCept.
- Open‑source prototype (or at least a detailed design) that can be integrated into existing LLM serving pipelines with minimal code changes.
Methodology
-
Feature Extraction (Stage I) – Each incoming request is profiled for attributes that affect inference cost:
- Expected number of tool calls
- Predicted output length (tokens)
- Model‑specific latency estimates These features are fed into a lightweight priority function that reorders the queue, moving “lightweight” or “fast‑to‑complete” requests ahead of heavy ones that would otherwise cause head‑of‑line blocking.
-
Runtime‑aware Re‑ordering (Stage II) – While the system processes the current batch, a monitor collects real‑time signals (GPU memory pressure, queue wait times, actual token generation speed). A feedback loop updates the priority scores and may reshuffle pending requests before they enter the next batch.
-
Dynamic Batching – Instead of a fixed maximum token count per batch (the common approach in vLLM), AugServe continuously tunes the batch size. When the hardware is under‑utilized, it expands the batch to pack more tokens; under heavy load, it shrinks the batch to keep latency low.
-
Implementation – Built on top of a standard inference engine (e.g., PyTorch + CUDA kernels) and integrated with a request dispatcher that can pause/resume batches without dropping in‑flight tokens.
Results & Findings
| Metric | AugServe vs. vLLM | AugServe vs. InferCept |
|---|---|---|
| Effective Throughput (requests / sec within SLO) | 4.7–33.1× improvement | 3.3–13.2× improvement |
| Time‑to‑First‑Token (TTFT) | ‑96.3 % (up to 96 % faster) | ‑95.0 % |
| Latency SLO Violation Rate | Near‑zero under tested loads | Near‑zero |
| GPU Utilization | More stable, higher average utilization | Higher average utilization |
The gains are most pronounced under bursty traffic and when requests involve many tool calls—scenarios where traditional FCFS queues suffer from severe head‑of‑line blocking.
Practical Implications
- Web‑scale AI products (chatbots, code assistants, search‑augmented agents) can serve many more concurrent users without over‑provisioning hardware, directly lowering cloud costs.
- Latency‑critical services (e.g., real‑time recommendation or decision‑support systems) can meet sub‑second SLOs even when the LLM must invoke external APIs, improving user satisfaction.
- DevOps simplification – Dynamic batching removes the need for manual tuning of batch‑size limits per model or hardware, reducing operational overhead.
- Compatibility – Because AugServe works as a scheduling layer on top of existing inference runtimes, teams can adopt it without rewriting model code or retraining models.
- Edge deployment – The adaptive scheduler can be trimmed for smaller GPUs, enabling more efficient on‑device LLM inference for augmented applications.
Limitations & Future Work
- Tool‑call prediction accuracy – Stage I relies on heuristics to estimate how many external calls a request will need; mispredictions can still cause sub‑optimal ordering.
- Overhead of re‑ordering – Continuous priority updates add a small CPU cost; scaling to thousands of simultaneous requests may require more sophisticated data structures.
- Hardware diversity – Experiments focus on a handful of GPU models; extending the adaptive logic to heterogeneous clusters (CPU‑only, TPUs, multi‑node setups) remains an open challenge.
- Generalization to non‑augmented LLMs – While the paper shows benefits for tool‑augmented workloads, it is unclear how much gain applies to pure text‑generation services.
Future research directions include learning‑based priority functions that adapt over time, tighter integration with orchestration frameworks (Kubernetes, Ray), and exploring how AugServe interacts with emerging quantization and sparsity techniques.
Authors
- Ying Wang
- Zhen Jin
- Jiexiong Xu
- Wenhai Lin
- Yiquan Chen
- Wenzhi Chen
Paper Information
- arXiv ID: 2512.04013v1
- Categories: cs.CL
- Published: December 3, 2025
- PDF: Download PDF