LLM inference

1 week ago · ai

Accelerating Large Language Model Decoding with Speculative Sampling

Imagine getting answers from a large language model almost twice as fast. Researchers use a small, quick helper that writes a few words ahead, then the big mode...

#large language models #speculative sampling #LLM inference #model decoding #speed optimization
1 month ago · ai

UC San Diego Lab Advances Generative AI Research With NVIDIA DGX B200 System

UC San Diego Lab Advances Generative AI Research With NVIDIA DGX B200 System December 17, 2025 by Zoe Kesslerhttps://blogs.nvidia.com/blog/author/zoekessler/ !...

#generative AI #NVIDIA DGX B200 #large language models #LLM inference #UC San Diego #Hao AI Lab #AI hardware
1 month ago · ai

From Theory to Practice: Demystifying the Key-Value Cache in Modern LLMs

Introduction — What is Key‑Value Cache and Why We Need It? !KV Cache illustrationhttps://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgra...

#key-value cache #LLM inference #transformer optimization #generative AI #performance acceleration #kv cache #AI engineering
1 month ago · ai

[Paper] Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging....

#LLM inference #Kubernetes #Slurm #vLLM #HPC
1 month ago · devops

[Paper] A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPU...

#LLM inference #dynamic scaling #GPU orchestration #goodput optimization #serving architecture