LLM serving

1 month ago · ai

[Paper] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficie...

#LLM serving #adaptive scheduling #dynamic batching #inference optimization #augmented LLM
1 month ago · ai

[Paper] DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing spe...

#speculative decoding #LLM serving #edge‑cloud inference #distributed inference #adaptive window control
1 month ago · ai

[Paper] Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows

Agentic workflows have emerged as a powerful paradigm for solving complex, multi-stage tasks, but serving them at scale is computationally expensive given the m...

#model routing #agentic workflows #LLM serving #scalable inference #cost optimization