[Paper] LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms

Published: 3 days ago (February 9, 2026 at 08:31 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.09323v1

Overview

LLM‑CoOpt is a new co‑design framework that tackles three persistent pain points of large‑language‑model (LLM) inference: memory‑bandwidth bottlenecks, redundant computation, and the difficulty of handling very long input sequences. By jointly redesigning algorithms and hardware‑friendly data paths, the authors show that inference can be made faster and more memory‑efficient without sacrificing model quality.

Key Contributions

Opt‑KV (Key‑Value Cache Optimization) – redesigns the KV‑cache read/write pipeline and applies FP8 quantization to shrink cache size while preserving accuracy.
Opt‑GQA (Grouped‑Query Attention) – replaces the standard multi‑head self‑attention with a grouped‑query formulation that shares key/value projections across heads, cutting FLOPs and memory traffic.
Opt‑Pa (Paged Attention) – introduces a two‑step “segment‑then‑lazy‑map” strategy that breaks ultra‑long sequences into chunks and only materializes the necessary attention windows, dramatically lowering memory pressure.
End‑to‑end co‑optimization – integrates the three techniques into a single inference stack and validates the approach on a real‑world LLaMa‑13B‑GPTQ model.
Performance gains – demonstrates up to 13.4 % higher throughput and 16.8 % lower latency with negligible impact on downstream task accuracy.

Methodology

Cache Redesign (Opt‑KV)
- The KV cache, which stores intermediate activations for autoregressive generation, is traditionally kept in FP16/32. LLM‑CoOpt compresses these tensors to FP8, halving the memory bandwidth needed for each token.
- A custom write‑back buffer and prefetch logic reorder cache accesses to improve spatial locality, reducing cache‑miss stalls on CPUs/GPUs.
Grouped‑Query Attention (Opt‑GQA)
- Instead of independent query/key/value projections per head, Opt‑GQA groups several heads to share the same key/value matrices while keeping distinct query matrices.
- This reduces the number of matrix multiplications from H (heads) to G (groups) and enables better reuse of the same key/value data across heads, which is especially beneficial on SIMD‑friendly hardware.
Paged Attention (Opt‑Pa)
- Long sequences are first split into fixed‑size pages (e.g., 512 tokens).
- During generation, only the pages that intersect the current attention window are materialized (“lazy mapping”), while the rest stay in compressed storage.
- The approach leverages OS‑level page‑fault handling and custom kernels to keep the active working set small.
Integration & Evaluation
- The three optimizations are composed into a single inference pipeline.
- Experiments run on a server‑grade GPU (NVIDIA A100) and a CPU‑only baseline, using the LLaMa‑13B‑GPTQ checkpoint.
- Accuracy is measured on standard language‑model benchmarks (e.g., WikiText‑103, LAMBADA) to ensure the quantization and algorithmic changes do not degrade performance.

Results & Findings

Metric	Baseline	LLM‑CoOpt (combined)	Δ
Throughput (tokens / s)	1.00×	1.13×	+13.4 %
End‑to‑end latency (ms / token)	1.00×	0.83×	–16.8 %
KV‑cache memory footprint	100 %	≈50 % (FP8)	–50 %
Accuracy (perplexity / LAMBADA)	Baseline	Within 0.2 % of baseline	No noticeable drop

The data show that each individual optimization contributes to the overall gain, but the biggest jump comes from the combination of reduced memory traffic (Opt‑KV) and fewer FLOPs (Opt‑GQA). Opt‑Pa shines on inputs longer than 4 k tokens, where baseline memory usage would otherwise explode.

Practical Implications

Faster SaaS APIs – Cloud providers can serve more requests per GPU, lowering cost per token for services like chatbots or code assistants.
Edge & On‑Device Inference – The FP8 cache and reduced compute make it feasible to run 13 B‑scale models on high‑end mobile or embedded GPUs with limited memory bandwidth.
Long‑Context Applications – Retrieval‑augmented generation, document summarization, and code analysis often need >8 k token windows; Opt‑Pa enables these workloads without resorting to expensive model‑splitting tricks.
Simplified Deployment – Because the optimizations are implemented as drop‑in kernel replacements (e.g., via custom CUDA kernels or ONNX Runtime extensions), existing inference stacks can adopt LLM‑CoOpt with minimal code changes.

Limitations & Future Work

Hardware Specificity – The current implementation is tuned for NVIDIA GPUs and x86 CPUs; performance on AMD GPUs or ARM‑based accelerators remains untested.
Quantization Sensitivity – While FP8 works well for LLaMa‑13B‑GPTQ, other model families (e.g., dense‑trained or instruction‑tuned variants) may require per‑layer calibration to avoid accuracy loss.
Scalability to >100 B – The authors note that for models beyond 100 B parameters, additional hierarchy (e.g., multi‑node KV caching) will be needed.
Future Directions – Extending Opt‑Pa to support dynamic page sizes, integrating sparsity‑aware attention kernels, and automating the co‑design process via compiler‑level optimizations are highlighted as promising next steps.

Authors

Jie Kong
Wei Wang
Jiehan Zhou
Chen Yu

Paper Information

arXiv ID: 2602.09323v1
Categories: cs.DC
Published: February 10, 2026
PDF: Download PDF

[Paper] LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Legitimate Overrides in Decentralized Protocols

[Paper] OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

[Paper] Contention Resolution, With and Without a Global Clock

[Paper] An Auction-Based Mechanism for Optimal Task Allocation and Resource Aware Containerization