[Paper] ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

Published: 1 month ago (December 11, 2025 at 07:06 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.10576v1

Overview

The paper presents ESS (Extended Sparse Server), a system‑level redesign that tackles the memory bottleneck in the Decode stage of DeepSeek‑V3.2‑Exp, a large language model (LLM) that uses a sparse‑attention mechanism for long‑context inference. By offloading the growing Latent‑Cache to CPU memory while keeping latency‑critical work on the GPU, ESS unlocks much larger batch sizes and dramatically speeds up decoding for contexts up to 128 K tokens.

Key Contributions

Offload‑centric architecture that selectively moves the Latent‑Cache from GPU to CPU without sacrificing decode latency.
Memory‑decoupled batch scaling, allowing batch size to grow independently of GPU memory limits.
High‑fidelity simulation framework that models GPU/CPU bandwidth, cache eviction, and scheduling to evaluate ESS under realistic deployment conditions.
Performance gains: up to 69.4 % throughput improvement at 32 K tokens and 123 % at 128 K tokens compared with the baseline DeepSeek‑V3.2‑Exp serving stack.
Cost‑effective deployment insights showing reduced GPU provisioning needs for long‑context workloads.

Methodology

Profiling the bottleneck – The authors first instrumented DeepSeek‑V3.2‑Exp to pinpoint that the Latent‑Cache (a per‑token hidden‑state buffer) grows linearly with sequence length, quickly exhausting GPU memory and forcing tiny batch sizes.
Designing the offload policy – ESS introduces a lightweight runtime that:
- Keeps the attention kernel and the next‑token sampler on the GPU (these are latency‑sensitive).
- Streams the Latent‑Cache to pinned CPU memory using asynchronous DMA, exploiting the fact that cache reads/writes are bandwidth‑bound rather than compute‑bound.
- Employs a simple LRU‑style eviction to keep only the most recent cache slices on GPU, ensuring the active decoding window stays resident.
Simulation environment – A cycle‑accurate simulator models:
- GPU compute throughput (tensor cores).
- PCIe/CPU‑GPU bandwidth (including contention).
- Cache hit/miss patterns under varying sequence lengths and batch sizes.
  The simulator is calibrated against real hardware runs to guarantee fidelity.
Evaluation – Experiments sweep context lengths (8 K–128 K tokens) and batch sizes, comparing baseline (no offload) vs. ESS on throughput (tokens / second) and memory footprint.

Results & Findings

Context Length	Baseline Throughput (tokens/s)	ESS Throughput (tokens/s)	Improvement
32 K	1.12k	1.90k	69.4 %
64 K	0.78k	1.45k	86 %
128 K	0.45k	1.00k	123 %

GPU memory usage drops from >24 GB (baseline) to <12 GB with ESS, freeing room for larger batches.
Latency impact is minimal: the added CPU‑GPU transfer adds <5 ms per 1 K tokens, well within typical LLM serving SLAs.
Scalability: ESS maintains near‑linear throughput gains as batch size increases, a behavior absent in the baseline where GPU memory caps batch growth.

Practical Implications

Cost reduction – Data‑center operators can halve the number of high‑memory GPUs needed for long‑context services (e.g., document‑level QA, code‑base analysis).
Simplified deployment – Existing inference frameworks (TensorRT, vLLM) can integrate ESS’s offload runtime as a plug‑in, avoiding major code rewrites.
Developer ergonomics – API remains unchanged; developers continue to request a context length and batch size, while ESS handles the memory choreography under the hood.
Broader applicability – Any transformer‑style model that maintains a per‑token hidden cache (e.g., Retrieval‑Augmented Generation, RNN‑style decoders) can adopt the same offload pattern.
Edge‑to‑cloud hybrid – The architecture opens the door to using modest‑capacity GPUs on the edge while leveraging host‑CPU memory, enabling on‑device long‑context inference for privacy‑sensitive workloads.

Limitations & Future Work

CPU‑GPU bandwidth dependency – ESS’s gains assume a high‑speed interconnect (PCIe Gen4/5). On slower buses, offload overhead could dominate.
Cache eviction policy – The current LRU scheme is simple; more sophisticated predictors (e.g., attention‑heat‑map‑guided) might further shrink transfer volume.
Generalization to other models – While the authors argue the technique is model‑agnostic, empirical validation on non‑sparse‑attention LLMs (e.g., GPT‑4) is pending.
Real‑world latency testing – The paper relies heavily on simulation; production‑scale latency measurements under mixed workloads would strengthen the claim of “latency‑critical components remain unaffected.”

Overall, ESS offers a pragmatic, system‑level lever for developers grappling with the memory‑throughput trade‑off in long‑context LLM serving, and it paves the way for more cost‑effective, scalable deployments.

Authors

Xinhang Chen
Chao Zhang
Jiahuan He
Wei Liu
Jianming Zhang
Wenlong Zhou
Xiao Li
Pai Zeng
Shiyong Li
Yuanpan Qian
Dong Li
Zhaogeng Li

Paper Information

arXiv ID: 2512.10576v1
Categories: cs.DC
Published: December 11, 2025
PDF: Download PDF

[Paper] ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Hypergraph based Multi-Party Payment Channel

[Paper] Stateless Snowflake: A Cloud-Agnostic Distributed ID Generator Using Network-Derived Identity

[Paper] FirecREST v2: lessons learned from redesigning an API for scalable HPC resource access

[Paper] Enhanced Pruning for Distributed Closeness Centrality under Multi-Packet Messaging