[Paper] ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp
Source: arXiv - 2512.10576v1
Overview
The paper presents ESS (Extended Sparse Server), a system‑level redesign that tackles the memory bottleneck in the Decode stage of DeepSeek‑V3.2‑Exp, a large language model (LLM) that uses a sparse‑attention mechanism for long‑context inference. By offloading the growing Latent‑Cache to CPU memory while keeping latency‑critical work on the GPU, ESS unlocks much larger batch sizes and dramatically speeds up decoding for contexts up to 128 K tokens.
Key Contributions
- Offload‑centric architecture that selectively moves the Latent‑Cache from GPU to CPU without sacrificing decode latency.
- Memory‑decoupled batch scaling, allowing batch size to grow independently of GPU memory limits.
- High‑fidelity simulation framework that models GPU/CPU bandwidth, cache eviction, and scheduling to evaluate ESS under realistic deployment conditions.
- Performance gains: up to 69.4 % throughput improvement at 32 K tokens and 123 % at 128 K tokens compared with the baseline DeepSeek‑V3.2‑Exp serving stack.
- Cost‑effective deployment insights showing reduced GPU provisioning needs for long‑context workloads.
Methodology
- Profiling the bottleneck – The authors first instrumented DeepSeek‑V3.2‑Exp to pinpoint that the Latent‑Cache (a per‑token hidden‑state buffer) grows linearly with sequence length, quickly exhausting GPU memory and forcing tiny batch sizes.
- Designing the offload policy – ESS introduces a lightweight runtime that:
- Keeps the attention kernel and the next‑token sampler on the GPU (these are latency‑sensitive).
- Streams the Latent‑Cache to pinned CPU memory using asynchronous DMA, exploiting the fact that cache reads/writes are bandwidth‑bound rather than compute‑bound.
- Employs a simple LRU‑style eviction to keep only the most recent cache slices on GPU, ensuring the active decoding window stays resident.
- Simulation environment – A cycle‑accurate simulator models:
- GPU compute throughput (tensor cores).
- PCIe/CPU‑GPU bandwidth (including contention).
- Cache hit/miss patterns under varying sequence lengths and batch sizes.
The simulator is calibrated against real hardware runs to guarantee fidelity.
- Evaluation – Experiments sweep context lengths (8 K–128 K tokens) and batch sizes, comparing baseline (no offload) vs. ESS on throughput (tokens / second) and memory footprint.
Results & Findings
| Context Length | Baseline Throughput (tokens/s) | ESS Throughput (tokens/s) | Improvement |
|---|---|---|---|
| 32 K | 1.12k | 1.90k | 69.4 % |
| 64 K | 0.78k | 1.45k | 86 % |
| 128 K | 0.45k | 1.00k | 123 % |
- GPU memory usage drops from >24 GB (baseline) to <12 GB with ESS, freeing room for larger batches.
- Latency impact is minimal: the added CPU‑GPU transfer adds <5 ms per 1 K tokens, well within typical LLM serving SLAs.
- Scalability: ESS maintains near‑linear throughput gains as batch size increases, a behavior absent in the baseline where GPU memory caps batch growth.
Practical Implications
- Cost reduction – Data‑center operators can halve the number of high‑memory GPUs needed for long‑context services (e.g., document‑level QA, code‑base analysis).
- Simplified deployment – Existing inference frameworks (TensorRT, vLLM) can integrate ESS’s offload runtime as a plug‑in, avoiding major code rewrites.
- Developer ergonomics – API remains unchanged; developers continue to request a context length and batch size, while ESS handles the memory choreography under the hood.
- Broader applicability – Any transformer‑style model that maintains a per‑token hidden cache (e.g., Retrieval‑Augmented Generation, RNN‑style decoders) can adopt the same offload pattern.
- Edge‑to‑cloud hybrid – The architecture opens the door to using modest‑capacity GPUs on the edge while leveraging host‑CPU memory, enabling on‑device long‑context inference for privacy‑sensitive workloads.
Limitations & Future Work
- CPU‑GPU bandwidth dependency – ESS’s gains assume a high‑speed interconnect (PCIe Gen4/5). On slower buses, offload overhead could dominate.
- Cache eviction policy – The current LRU scheme is simple; more sophisticated predictors (e.g., attention‑heat‑map‑guided) might further shrink transfer volume.
- Generalization to other models – While the authors argue the technique is model‑agnostic, empirical validation on non‑sparse‑attention LLMs (e.g., GPT‑4) is pending.
- Real‑world latency testing – The paper relies heavily on simulation; production‑scale latency measurements under mixed workloads would strengthen the claim of “latency‑critical components remain unaffected.”
Overall, ESS offers a pragmatic, system‑level lever for developers grappling with the memory‑throughput trade‑off in long‑context LLM serving, and it paves the way for more cost‑effective, scalable deployments.
Authors
- Xinhang Chen
- Chao Zhang
- Jiahuan He
- Wei Liu
- Jianming Zhang
- Wenlong Zhou
- Xiao Li
- Pai Zeng
- Shiyong Li
- Yuanpan Qian
- Dong Li
- Zhaogeng Li
Paper Information
- arXiv ID: 2512.10576v1
- Categories: cs.DC
- Published: December 11, 2025
- PDF: Download PDF