inference acceleration | EUNO.NEWS

1个月前 · ai

AdaSPEC：用于高效投机解码器的选择性知识蒸馏

引言 AdaSPEC 是一种新方法，通过使用小型草稿模型进行初始生成阶段，然后进行验证，以加速大语言模型。

#speculative decoding #knowledge distillation #large language models #inference acceleration #draft model #AdaSPEC #AI efficiency #model compression
1个月前 · ai

[Paper] Beluga：一种基于 CXL 的内存架构，用于可扩展且高效的 LLM KVCache 管理

LLM 模型规模的快速增长以及对长上下文推理的日益需求，使得内存成为 GPU 加速服务系统的关键瓶颈……

#CXL #LLM #KVCache #memory architecture #inference acceleration