[Paper] LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms
Source: arXiv - 2602.09323v1
Overview
LLM‑CoOpt is a new co‑design framework that tackles three persistent pain points of large‑language‑model (LLM) inference: memory‑bandwidth bottlenecks, redundant computation, and the difficulty of handling very long input sequences. By jointly redesigning algorithms and hardware‑friendly data paths, the authors show that inference can be made faster and more memory‑efficient without sacrificing model quality.
Key Contributions
- Opt‑KV (Key‑Value Cache Optimization) – redesigns the KV‑cache read/write pipeline and applies FP8 quantization to shrink cache size while preserving accuracy.
- Opt‑GQA (Grouped‑Query Attention) – replaces the standard multi‑head self‑attention with a grouped‑query formulation that shares key/value projections across heads, cutting FLOPs and memory traffic.
- Opt‑Pa (Paged Attention) – introduces a two‑step “segment‑then‑lazy‑map” strategy that breaks ultra‑long sequences into chunks and only materializes the necessary attention windows, dramatically lowering memory pressure.
- End‑to‑end co‑optimization – integrates the three techniques into a single inference stack and validates the approach on a real‑world LLaMa‑13B‑GPTQ model.
- Performance gains – demonstrates up to 13.4 % higher throughput and 16.8 % lower latency with negligible impact on downstream task accuracy.
Methodology
-
Cache Redesign (Opt‑KV)
- The KV cache, which stores intermediate activations for autoregressive generation, is traditionally kept in FP16/32. LLM‑CoOpt compresses these tensors to FP8, halving the memory bandwidth needed for each token.
- A custom write‑back buffer and prefetch logic reorder cache accesses to improve spatial locality, reducing cache‑miss stalls on CPUs/GPUs.
-
Grouped‑Query Attention (Opt‑GQA)
- Instead of independent query/key/value projections per head, Opt‑GQA groups several heads to share the same key/value matrices while keeping distinct query matrices.
- This reduces the number of matrix multiplications from H (heads) to G (groups) and enables better reuse of the same key/value data across heads, which is especially beneficial on SIMD‑friendly hardware.
-
Paged Attention (Opt‑Pa)
- Long sequences are first split into fixed‑size pages (e.g., 512 tokens).
- During generation, only the pages that intersect the current attention window are materialized (“lazy mapping”), while the rest stay in compressed storage.
- The approach leverages OS‑level page‑fault handling and custom kernels to keep the active working set small.
-
Integration & Evaluation
- The three optimizations are composed into a single inference pipeline.
- Experiments run on a server‑grade GPU (NVIDIA A100) and a CPU‑only baseline, using the LLaMa‑13B‑GPTQ checkpoint.
- Accuracy is measured on standard language‑model benchmarks (e.g., WikiText‑103, LAMBADA) to ensure the quantization and algorithmic changes do not degrade performance.
Results & Findings
| Metric | Baseline | LLM‑CoOpt (combined) | Δ |
|---|---|---|---|
| Throughput (tokens / s) | 1.00× | 1.13× | +13.4 % |
| End‑to‑end latency (ms / token) | 1.00× | 0.83× | –16.8 % |
| KV‑cache memory footprint | 100 % | ≈50 % (FP8) | –50 % |
| Accuracy (perplexity / LAMBADA) | Baseline | Within 0.2 % of baseline | No noticeable drop |
The data show that each individual optimization contributes to the overall gain, but the biggest jump comes from the combination of reduced memory traffic (Opt‑KV) and fewer FLOPs (Opt‑GQA). Opt‑Pa shines on inputs longer than 4 k tokens, where baseline memory usage would otherwise explode.
Practical Implications
- Faster SaaS APIs – Cloud providers can serve more requests per GPU, lowering cost per token for services like chatbots or code assistants.
- Edge & On‑Device Inference – The FP8 cache and reduced compute make it feasible to run 13 B‑scale models on high‑end mobile or embedded GPUs with limited memory bandwidth.
- Long‑Context Applications – Retrieval‑augmented generation, document summarization, and code analysis often need >8 k token windows; Opt‑Pa enables these workloads without resorting to expensive model‑splitting tricks.
- Simplified Deployment – Because the optimizations are implemented as drop‑in kernel replacements (e.g., via custom CUDA kernels or ONNX Runtime extensions), existing inference stacks can adopt LLM‑CoOpt with minimal code changes.
Limitations & Future Work
- Hardware Specificity – The current implementation is tuned for NVIDIA GPUs and x86 CPUs; performance on AMD GPUs or ARM‑based accelerators remains untested.
- Quantization Sensitivity – While FP8 works well for LLaMa‑13B‑GPTQ, other model families (e.g., dense‑trained or instruction‑tuned variants) may require per‑layer calibration to avoid accuracy loss.
- Scalability to >100 B – The authors note that for models beyond 100 B parameters, additional hierarchy (e.g., multi‑node KV caching) will be needed.
- Future Directions – Extending Opt‑Pa to support dynamic page sizes, integrating sparsity‑aware attention kernels, and automating the co‑design process via compiler‑level optimizations are highlighted as promising next steps.
Authors
- Jie Kong
- Wei Wang
- Jiehan Zhou
- Chen Yu
Paper Information
- arXiv ID: 2602.09323v1
- Categories: cs.DC
- Published: February 10, 2026
- PDF: Download PDF