[Paper] SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

Published: 5 days ago (November 29, 2025 at 11:15 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.00719v1

Overview

The paper introduces SIMPLE, a novel architecture that moves the sampling step of large‑language‑model (LLM) inference off the GPU and onto a lightweight CPU service. By decoupling this “decision plane” from the heavily optimized GPU data‑plane (attention, GEMM, KV‑cache), SIMPLE eliminates a growing bottleneck that limits throughput and latency in modern, highly parallel LLM deployments.

Key Contributions

Decision‑plane disaggregation: Turns sampling into a separate, CPU‑side service that can run in parallel with GPU computation.
Sequence‑parallel sampling: Shards the batch dimension across CPU workers, removing expensive vocabulary‑axis collective operations.
Linear‑time CPU sampling kernels: Introduces column‑wise penalties and a “truncation‑first” filter that achieve single‑pass, O(vocab) complexity without costly sorting.
Speculative Hot‑Vocab Sampling (SHVS): Dynamically samples from a small, high‑probability “hot” vocabulary set, applying a rejection‑correctness step to retain exactness while dramatically cutting work.
Zero‑code‑change integration: SIMPLE plugs into existing serving stacks without requiring changes to user applications or model code.

Methodology

Disaggregating the pipeline – The authors treat sampling as an independent micro‑service. While the GPU continues to compute attention and update the KV cache, the CPU concurrently receives the logits, performs sampling, and streams the selected token back to the next pipeline stage.
Sequence‑parallel work division – Instead of gathering the full logits matrix (batch × vocab) on a single node, each CPU worker processes a slice of the batch. This eliminates the need for an all‑reduce across the vocabulary dimension, which is the primary scaling choke point.
Efficient CPU algorithm
- Column‑wise penalties apply temperature, top‑p, and other constraints directly on the logits column, avoiding per‑token scans.
- Truncation‑first filtering quickly discards low‑probability tokens before any sorting, guaranteeing a single linear pass over the vocabulary.
Speculative Hot‑Vocab Sampling (SHVS)
- A lightweight model predicts the size of a “hot” vocab set that captures most of the probability mass.
- Sampling is performed only on this reduced set; if the sampled token falls outside, a rejection step re‑samples from the full distribution, preserving exactness.
Overlap with GPU work – The CPU service runs asynchronously, so its latency is hidden behind the GPU’s compute time, effectively shrinking the decision‑plane’s contribution to the critical path.

Results & Findings

Metric	Baseline (GPU‑only)	SIMPLE	Improvement
End‑to‑end throughput (tokens/s)	1.0×	up to 1.96×	+96 %
P95 latency (per token)	100 ms (example)	35‑80 ms	–20 % to –65 %
GPU utilization (last PP stage)	70 % (capped by sampling)	>90 %	–
Scaling with TP/PP	Degrades as GPUs get faster	Remains linear	–

Key takeaways

The decision plane’s share of iteration time drops from ~30 % to <5 % after SIMPLE.
SHVS alone accounts for most of the speed‑up, especially when the hot‑vocab size is tuned per model/temperature.
SIMPLE works with existing tensor‑parallel and pipeline‑parallel frameworks (e.g., Megatron‑LM, DeepSpeed) without code modifications.

Practical Implications

Higher throughput for LLM APIs: Cloud providers can serve more requests per GPU, reducing cost per token.
Lower tail latency: Interactive applications (code assistants, chatbots) benefit from tighter 95th‑percentile response times, improving user experience.
Future‑proof scaling: As GPU compute continues to accelerate, the decision plane will no longer become the limiting factor, allowing TP/PP to scale unhindered.
Simplified deployment: Teams can adopt SIMPLE as a drop‑in service layer, avoiding invasive changes to model graphs or inference code.
CPU‑friendly workloads: The approach leverages under‑utilized CPU resources in typical inference clusters, improving overall hardware efficiency.

Limitations & Future Work

CPU load balancing: In extreme batch‑size regimes, the CPU side may become saturated; adaptive load‑shedding or multi‑node CPU scaling is an open question.
Hot‑vocab model accuracy: The heuristic for hot‑vocab sizing is simple; more sophisticated, model‑aware predictors could further boost throughput.
Memory overhead: Maintaining hot‑vocab tables per model adds modest CPU memory usage, which may be non‑trivial for extremely large vocabularies.
Generality beyond decoder‑only LLMs: The paper focuses on autoregressive models; extending SIMPLE to encoder‑decoder or multimodal architectures remains future work.

Bottom line: SIMPLE demonstrates that moving the sampling step off the GPU and redesigning it for CPU parallelism can nearly double LLM serving throughput and cut tail latency—without any changes to user code. For developers building scalable LLM services, it offers a pragmatic path to unlock the next generation of performance gains.

Authors

Bohan Zhao
Zane Cao
Yongchao He

Paper Information

arXiv ID: 2512.00719v1
Categories: cs.DC
Published: November 30, 2025
PDF: Download PDF

[Paper] SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Federated Learning for Terahertz Wireless Communication

[Paper] FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration

[Paper] Offloading to CXL-based Computational Memory

[Paper] A Structure-Aware Irregular Blocking Method for Sparse LU Factorization