[Paper] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

Published: 1 month ago (December 3, 2025 at 12:49 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04013v1

Overview

The paper introduces AugServe, a new inference serving framework that dramatically speeds up “augmented” large language model (LLM) workloads—LLMs that call external tools (search, calculators, APIs, etc.) during generation. By rethinking how requests are scheduled and how token batches are formed, AugServe cuts queuing delays and boosts the number of requests that can be satisfied within strict latency SLOs, a critical factor for real‑time web‑app experiences.

Key Contributions

Two‑stage adaptive scheduling that first orders requests using static inference‑time features (e.g., expected tool calls, token length) and then continuously refines the order with live runtime metrics.
Dynamic token‑batch sizing that reacts to current GPU/CPU load and request mix, replacing the static batch‑size limits used by existing servers.
Comprehensive evaluation showing 4.7–33.1× higher effective throughput and up to 96 % lower time‑to‑first‑token (TTFT) compared with state‑of‑the‑art serving stacks such as vLLM and InferCept.
Open‑source prototype (or at least a detailed design) that can be integrated into existing LLM serving pipelines with minimal code changes.

Methodology

Feature Extraction (Stage I) – Each incoming request is profiled for attributes that affect inference cost:
- Expected number of tool calls
- Predicted output length (tokens)
- Model‑specific latency estimates
  These features feed a lightweight priority function that reorders the queue, moving “lightweight” or “fast‑to‑complete” requests ahead of heavy ones that would otherwise cause head‑of‑line blocking.
Runtime‑aware Re‑ordering (Stage II) – While the system processes the current batch, a monitor collects real‑time signals (GPU memory pressure, queue wait times, actual token generation speed). A feedback loop updates the priority scores and may reshuffle pending requests before they enter the next batch.
Dynamic Batching – Instead of a fixed maximum token count per batch (the common approach in vLLM), AugServe continuously tunes the batch size. When the hardware is under‑utilized, it expands the batch to pack more tokens; under heavy load, it shrinks the batch to keep latency low.
Implementation – Built on top of a standard inference engine (e.g., PyTorch + CUDA kernels) and integrated with a request dispatcher that can pause/resume batches without dropping in‑flight tokens.

Results & Findings

Metric	AugServe vs. vLLM	AugServe vs. InferCept
Effective Throughput (requests / sec within SLO)	4.7–33.1× improvement	3.3–13.2× improvement
Time‑to‑First‑Token (TTFT)	‑96.3 % (up to 96 % faster)	‑95.0 %
Latency SLO Violation Rate	Near‑zero under tested loads	Near‑zero
GPU Utilization	More stable, higher average utilization	Higher average utilization

The gains are most pronounced under bursty traffic and when requests involve many tool calls—scenarios where traditional FCFS queues suffer from severe head‑of‑line blocking.

Practical Implications

Web‑scale AI products (chatbots, code assistants, search‑augmented agents) can serve many more concurrent users without over‑provisioning hardware, directly lowering cloud costs.
Latency‑critical services (e.g., real‑time recommendation or decision‑support systems) can meet sub‑second SLOs even when the LLM must invoke external APIs, improving user satisfaction.
DevOps simplification – Dynamic batching removes the need for manual tuning of batch‑size limits per model or hardware, reducing operational overhead.
Compatibility – Because AugServe works as a scheduling layer on top of existing inference runtimes, teams can adopt it without rewriting model code or retraining models.
Edge deployment – The adaptive scheduler can be trimmed for smaller GPUs, enabling more efficient on‑device LLM inference for augmented applications.

Limitations & Future Work

Tool‑call prediction accuracy – Stage I relies on heuristics to estimate how many external calls a request will need; mispredictions can still cause sub‑optimal ordering.
Overhead of re‑ordering – Continuous priority updates add a small CPU cost; scaling to thousands of simultaneous requests may require more sophisticated data structures.
Hardware diversity – Experiments focus on a handful of GPU models; extending the adaptive logic to heterogeneous clusters (CPU‑only, TPUs, multi‑node setups) remains an open challenge.
Generalization to non‑augmented LLMs – While the paper shows benefits for tool‑augmented workloads, it is unclear how much gain applies to pure text‑generation services.

Future research directions include learning‑based priority functions that adapt over time, tighter integration with orchestration frameworks (Kubernetes, Ray), and exploring how AugServe interacts with emerging quantization and sparsity techniques.

Authors

Ying Wang
Zhen Jin
Jiexiong Xu
Wenhai Lin
Yiquan Chen
Wenzhi Chen

Paper Information

arXiv ID: 2512.04013v1
Categories: cs.CL
Published: December 3, 2025
PDF: Download PDF

[Paper] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

SoftBank and Nvidia reportedly in talks to fund Skild AI at $14B, nearly tripling its value

Google’s AI try-on app Doppl adds a shoppable discovery feed

Google says there are ‘no plans’ to put ads in the Gemini app

Gemini for Home update already working on some third-party Google Assistant speakers