inference optimization

5天前 · ai

DeepSeek的条件记忆修复了静默的LLM浪费：GPU周期因静态查找而损失

当企业 LLM 检索产品名称、技术规格或标准合同条款时，它正在使用为复杂任务设计的昂贵 GPU 计算……

#LLM #conditional memory #GPU efficiency #inference optimization #AI infrastructure #model serving
1周前 · ai

快速 Transformer 解码：只需一个 Write-Head

概述：想象一下你的手机在逐字构建句子，并且必须一次又一次地获取相同的大块信息——这会导致回复变慢。

#transformer decoding #inference optimization #shared memory #write-head #on-device AI
3周前 · ai

ChatLLM 提出简化方案以解决 AI 的真实瓶颈

在过去的几年里，关于 AI 的大量讨论围绕着一个看似简单却具有欺骗性的单一问题：哪个模型是最好的？但新的…

#AI bottleneck #model selection #LLM performance #ChatLLM #inference optimization #multimodal AI #reasoning models
1个月前 · ai

[Paper] AugServe：自适应请求调度用于增强大型语言模型推理服务

随着带有外部工具的增强型大型语言模型（LLMs）在网页应用中日益流行，提升增强型 LLM 推理服务的效率……

#LLM serving #adaptive scheduling #dynamic batching #inference optimization #augmented LLM