[Paper] SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving
Source: arXiv - 2512.00719v1
Overview
The paper introduces SIMPLE, a novel architecture that moves the sampling step of large‑language‑model (LLM) inference off the GPU and onto a lightweight CPU service. By decoupling this “decision plane” from the heavily optimized GPU data‑plane (attention, GEMM, KV‑cache), SIMPLE eliminates a growing bottleneck that limits throughput and latency in modern, highly parallel LLM deployments.
Key Contributions
- Decision‑plane disaggregation: Turns sampling into a separate, CPU‑side service that can run in parallel with GPU computation.
- Sequence‑parallel sampling: Shards the batch dimension across CPU workers, removing expensive vocabulary‑axis collective operations.
- Linear‑time CPU sampling kernels: Introduces column‑wise penalties and a “truncation‑first” filter that achieve single‑pass, O(vocab) complexity without costly sorting.
- Speculative Hot‑Vocab Sampling (SHVS): Dynamically samples from a small, high‑probability “hot” vocabulary set, applying a rejection‑correctness step to retain exactness while dramatically cutting work.
- Zero‑code‑change integration: SIMPLE plugs into existing serving stacks without requiring changes to user applications or model code.
Methodology
- Disaggregating the pipeline – The authors treat sampling as an independent micro‑service. While the GPU continues to compute attention and update the KV cache, the CPU concurrently receives the logits, performs sampling, and streams the selected token back to the next pipeline stage.
- Sequence‑parallel work division – Instead of gathering the full logits matrix (batch × vocab) on a single node, each CPU worker processes a slice of the batch. This eliminates the need for an all‑reduce across the vocabulary dimension, which is the primary scaling choke point.
- Efficient CPU algorithm
- Column‑wise penalties apply temperature, top‑p, and other constraints directly on the logits column, avoiding per‑token scans.
- Truncation‑first filtering quickly discards low‑probability tokens before any sorting, guaranteeing a single linear pass over the vocabulary.
- Speculative Hot‑Vocab Sampling (SHVS)
- A lightweight model predicts the size of a “hot” vocab set that captures most of the probability mass.
- Sampling is performed only on this reduced set; if the sampled token falls outside, a rejection step re‑samples from the full distribution, preserving exactness.
- Overlap with GPU work – The CPU service runs asynchronously, so its latency is hidden behind the GPU’s compute time, effectively shrinking the decision‑plane’s contribution to the critical path.
Results & Findings
| Metric | Baseline (GPU‑only) | SIMPLE | Improvement |
|---|---|---|---|
| End‑to‑end throughput (tokens/s) | 1.0× | up to 1.96× | +96 % |
| P95 latency (per token) | 100 ms (example) | 35‑80 ms | –20 % to –65 % |
| GPU utilization (last PP stage) | 70 % (capped by sampling) | >90 % | – |
| Scaling with TP/PP | Degrades as GPUs get faster | Remains linear | – |
Key takeaways
- The decision plane’s share of iteration time drops from ~30 % to <5 % after SIMPLE.
- SHVS alone accounts for most of the speed‑up, especially when the hot‑vocab size is tuned per model/temperature.
- SIMPLE works with existing tensor‑parallel and pipeline‑parallel frameworks (e.g., Megatron‑LM, DeepSpeed) without code modifications.
Practical Implications
- Higher throughput for LLM APIs: Cloud providers can serve more requests per GPU, reducing cost per token.
- Lower tail latency: Interactive applications (code assistants, chatbots) benefit from tighter 95th‑percentile response times, improving user experience.
- Future‑proof scaling: As GPU compute continues to accelerate, the decision plane will no longer become the limiting factor, allowing TP/PP to scale unhindered.
- Simplified deployment: Teams can adopt SIMPLE as a drop‑in service layer, avoiding invasive changes to model graphs or inference code.
- CPU‑friendly workloads: The approach leverages under‑utilized CPU resources in typical inference clusters, improving overall hardware efficiency.
Limitations & Future Work
- CPU load balancing: In extreme batch‑size regimes, the CPU side may become saturated; adaptive load‑shedding or multi‑node CPU scaling is an open question.
- Hot‑vocab model accuracy: The heuristic for hot‑vocab sizing is simple; more sophisticated, model‑aware predictors could further boost throughput.
- Memory overhead: Maintaining hot‑vocab tables per model adds modest CPU memory usage, which may be non‑trivial for extremely large vocabularies.
- Generality beyond decoder‑only LLMs: The paper focuses on autoregressive models; extending SIMPLE to encoder‑decoder or multimodal architectures remains future work.
Bottom line: SIMPLE demonstrates that moving the sampling step off the GPU and redesigning it for CPU parallelism can nearly double LLM serving throughput and cut tail latency—without any changes to user code. For developers building scalable LLM services, it offers a pragmatic path to unlock the next generation of performance gains.
Authors
- Bohan Zhao
- Zane Cao
- Yongchao He
Paper Information
- arXiv ID: 2512.00719v1
- Categories: cs.DC
- Published: November 30, 2025
- PDF: Download PDF