[Paper] Efficient Multi-round LLM Inference over Disaggregated Serving

Published: 3 days ago (February 16, 2026 at 02:07 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.14516v1

Overview

Large Language Models (LLMs) are now being used in multi‑turn scenarios such as autonomous agents and iterative retrieval, where a single user request spawns a series of “prefill” (prompt processing) and “decode” (token generation) steps. Existing serving stacks treat these two phases as separate, static pipelines, which works for one‑shot queries but falls short when the workload constantly flips between prefill and decode. The paper introduces AMPD, a disaggregated serving framework that dynamically coordinates where and how each prefill‑decode pair runs, boosting the chance of meeting latency Service Level Objectives (SLOs) in multi‑round inference.

Key Contributions

Dynamic workload coordination – AMPD monitors real‑time request patterns and decides on‑the‑fly whether a prefill should run on the prefill accelerator or be offloaded to the decode side, minimizing idle time.
Adaptive scheduling algorithm – A novel planner computes optimal resource allocation (CPU/GPU, memory) and parallelism strategies for each round, balancing compute‑bound prefill and memory‑bound decode phases.
Unified disaggregated architecture – Extends the common prefill‑decode (PD) disaggregation model to support interleaved multi‑round workloads without requiring separate model replicas for each phase.
Empirical validation – Experiments on popular LLMs (e.g., LLaMA‑2, Falcon) show up to 30‑45 % higher SLO attainment compared with state‑of‑the‑art serving systems such as vLLM and DeepSpeed‑Inference.

Methodology

Workload Characterization – The authors first profile multi‑round inference traces, exposing three patterns: (a) small incremental prefills, (b) long decode bursts, and (c) frequent switches between the two.
Real‑time Dispatcher – A lightweight controller receives per‑request metrics (token count, remaining context length) and decides the execution venue for the next prefill: either the prefill node (high compute, low latency) or the decode node (high bandwidth memory).
Planning Algorithm – Using a mixed‑integer linear program (MILP) that captures CPU/GPU capacity, memory bandwidth, and SLO deadlines, the planner outputs:
- Resource split (how many GPUs to allocate to each phase)
- Parallelism degree (how many requests to batch together)
- Placement policy (which node runs the upcoming prefill)
  The planner runs periodically (e.g., every 100 ms) to adapt to workload spikes.
Execution Engine – The chosen node executes the prefill, caches the intermediate KV‑cache, and hands off control to the decode node. Because the KV‑cache is stored in a disaggregated memory pool, both nodes can read/write it without costly data copies.

Results & Findings

Metric	Baseline (vLLM)	Baseline (DeepSpeed‑Inference)	AMPD
99th‑percentile latency (multi‑round)	1.85 s	1.72 s	1.12 s
SLO attainment @ 1 s	62 %	68 %	91 %
Throughput (requests/s)	28	31	44
GPU memory overhead	1.2 × model size	1.1 × model size	1.0 × model size

Latency reduction stems from eliminating unnecessary prefill‑decode hand‑offs and better packing of small prefills into idle compute slots.
Higher SLO attainment is achieved because the planner can proactively reserve decode bandwidth for upcoming token bursts.
Memory efficiency: By sharing a single KV‑cache across phases, AMPD avoids duplicating model weights on both nodes.

Practical Implications

LLM‑powered agents (e.g., code assistants, autonomous chatbots) can now guarantee tighter response times even when they need to call external tools or perform iterative reasoning.
Cloud providers can pack more LLM instances onto the same hardware pool, reducing cost per token and improving multi‑tenant isolation.
Developers gain a simple API surface: they submit a “session” token, and the serving layer automatically handles the prefill‑decode juggling, freeing them from manual batching heuristics.
Edge deployments that separate compute (e.g., a small accelerator) from memory (e.g., high‑bandwidth DRAM) can adopt the same disaggregated model, extending the benefits to on‑device inference scenarios.

Limitations & Future Work

Model size ceiling – The current prototype assumes the entire KV‑cache fits in the shared memory pool; extremely large models (>100 B parameters) may exceed available bandwidth.
Planner overhead – Although lightweight, the MILP solver adds a few milliseconds of latency; scaling to thousands of concurrent sessions could require a more approximate heuristic.
Hardware diversity – Experiments were run on homogeneous GPU clusters; heterogeneous setups (CPU‑only decode, FPGA prefill) remain unexplored.
Security & isolation – Sharing KV‑cache across nodes raises questions about cross‑tenant data leakage, which the authors plan to address with encrypted cache slices.

Bottom line: AMPD demonstrates that a smarter, adaptive orchestration of prefill and decode phases can unlock substantial latency and throughput gains for the next generation of multi‑round LLM applications. Developers looking to build responsive AI agents should keep an eye on disaggregated serving frameworks that embody these ideas.

Authors

Wenhao He
Youhe Jiang
Penghao Zhao
Quanqing Xu
Eiko Yoneki
Bin Cui
Fangcheng Fu

Paper Information

arXiv ID: 2602.14516v1
Categories: cs.DC
Published: February 16, 2026
PDF: Download PDF

[Paper] Efficient Multi-round LLM Inference over Disaggregated Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SRFed: Mitigating Poisoning Attacks in Privacy-Preserving Federated Learning with Heterogeneous Data

[Paper] How Reliable is Your Service at the Extreme Edge? Analytical Modeling of Computational Reliability

[Paper] Load Balanced Parallel Node Generation for Meshless Numerical Methods

[Paper] push0: Scalable and Fault-Tolerant Orchestration for Zero-Knowledge Proof Generation