[Paper] A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Published: 1 month ago (November 25, 2025 at 09:27 PM EST)

1 min read

Source: arXiv

Source: arXiv

Abstract

To meet strict Service-Level Objectives (SLOs), contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of LLM workloads causes producer‑consumer imbalance between the two instance types in such disaggregated architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic LLM inference system that adjusts instance allocations to achieve an optimal prefill‑to‑decoding (P/D) ratio based on real‑time load monitoring. Combined with an appropriate request‑scheduling policy, DOPD effectively resolves imbalances between prefill and decoding instances and mitigates resource allocation mismatches due to mixed‑length requests under high concurrency.

Experimental evaluations show that, compared with vLLM and DistServe (representative aggregation‑based and disaggregation‑based approaches), DOPD improves overall system goodput by up to 1.5×, decreases P90 time‑to‑first‑token (TTFT) by up to 67.5 %, and decreases P90 time‑per‑output‑token (TPOT) by up to 22.8 %. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99 % SLO attainment while using fewer additional resources.

[Paper] A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Abstract

Related posts

# Otimizando Imagens Docker: Boas Práticas para Builds Eficientes

Amazon EKS Capabilities: Quick Summary

Why Junior Developers Remain Essential in the Age of AI

AWS re:Invent 2025: How to watch and follow along live