GPU orchestration | EUNO.NEWS

1 month ago · devops

[Paper] A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPU...

#LLM inference #dynamic scaling #GPU orchestration #goodput optimization #serving architecture