inference acceleration

1 month ago · ai

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

Introduction AdaSPEC is a new method that speeds up large language models by using a small draft model for the initial generation pass, followed by verificatio...

#speculative decoding #knowledge distillation #large language models #inference acceleration #draft model #AdaSPEC #AI efficiency #model compression
1 month ago · ai

[Paper] Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving system...

#CXL #LLM #KVCache #memory architecture #inference acceleration