How Many Users Can Your LLM Server Really Handle?
Source: VMware Blog
The Problem
Infrastructure engineers often face a vast configuration space and ask questions such as:
- Will tuning
--max-num-batched-tokensor--gpu-memory-utilizationin vLLM improve throughput? - Could these changes unintentionally degrade tail latency?
The official vLLM documentation explains how to tune these parameters, but it rarely provides a systematic method for discovering the optimal configuration for a specific workload, hardware architecture, and strict Service Level Agreement (SLA).
Our Solution
We conducted a comprehensive capacity‑planning initiative for a 120‑billion‑parameter Mixture‑of‑Experts (MoE) model (gpt‑oss‑120b) deployed across multiple NVIDIA H100 and H200 clusters to power an internal AI coding assistant.
Rather than simply publishing the final capacity metrics, we documented the end‑to‑end methodology we used to achieve them.
Read the full technical white paper:
SPOC: a Stateful, Profile‑based Optimization for LLM Capacity Planning Methodology
The white paper serves as a comprehensive guide to LLM performance engineering, equipping infrastructure teams with the analytical tools and empirical techniques required to:
- Construct stateful, multi‑turn datasets that accurately simulate developers querying shared enterprise monorepos.
- Apply multi‑objective evolutionary algorithms (Optuna NSGA‑II) to mathematically explore the inference engine’s parameter space, replacing heuristic guesswork with rigorous optimization.
- Deploy an advanced telemetry stack (Prometheus + DCGM Exporter) to correlate internal inference‑engine metrics with physical hardware state.
- Capture and interpret kernel‑level NVIDIA Nsight Systems traces to identify true architectural bottlenecks—often contrary to simple theoretical roofline predictions.
Who Should Read This?
If you are responsible for scaling LLM infrastructure, this paper provides the empirical blueprint needed to move from estimating capacity to systematically measuring and optimizing it.
The Problem with the Just Run a Benchmark Concept
Standard LLM benchmarks send a fixed prompt at a fixed concurrency and report average latency, or single‑turn metrics (MLPerf, GenAI Perf, InferenceMax). That works for leaderboard comparisons, but it falls short for capacity planning in real‑world use cases—e.g., many follow‑up questions for coding tasks or log analysis. In those scenarios, multi‑turn traffic simulation is a must.
Why Real Traffic Is Messy
| Traffic segment | Share of users | Request size (tokens) | How it stresses the system |
|---|---|---|---|
| Short | 70 % | 5 k → 50 k | Sets the floor for time‑to‑first‑token (TTFT) |
| Medium | 20 % | 15 k → 120 k | Balances TTFT and compute load |
| Large | 10 % | 75 k → >128 k (hits context limit) | Dominates GPU memory bandwidth and prefill compute |
- Short requests dominate the request rate, determining the minimum latency users will see.
- Large requests consume the most GPU memory and compute resources, often becoming the bottleneck.
- Treating all traffic as “average‑sized” yields a single number that doesn’t predict where the system will actually break.
Bottom Line
We need a benchmark that reflects the heterogeneous mix of request sizes and multi‑turn interactions typical of production workloads—not just a single‑turn, average‑size test.
What We Built
The white paper describes a framework with three core stages:
1. Workload Modeling
- User profiles – Defined three profiles (P0, P1, P2) calibrated from observed usage patterns. Each profile has its own prompt‑size distribution, output budget, and think time.
- Stateful corpus – Built from open‑source trajectories:
- Simulation – Used Locust to generate multi‑turn streaming conversations that mimic real developers interacting with a coding assistant. The simulation includes a Partial Common Ground geometry to emulate shared enterprise monorepos.
2. Evolutionary Parameter Search
Instead of manual tuning or exhaustive grid search, we employed Optuna with its NSGA‑II sampler to explore the vLLM parameter space at our target concurrency.
NSGA‑II is a multi‑objective evolutionary algorithm that simultaneously optimizes:
- Throughput
- Time‑to‑first‑token (TTFT)
- Inter‑token latency
The algorithm discovers the Pareto front—the set of configurations where improving one metric would degrade another.
3. Kernel‑Level Profiling
Captured NVIDIA Nsight Systems traces during steady‑state load at capacity ceilings (300 concurrent users on 4 × H100, and 85 users on 2 × H200).
Decomposed GPU active time into functional categories:
- Flash Attention
- MoE Expert GEMMs
- NCCL collectives
The traces revealed that, for this sparse MoE architecture at large batch sizes, the system becomes heavily bound by Attention compute and memory bandwidth, contradicting simple roofline predictions.
4. Scaling the Best Configuration
- Swept the optimal configuration across various concurrency levels.
- Collected metrics with Prometheus and DCGM Exporter hardware counters.
What You Will Learn from the Paper
The paper serves as both a reference and a practical guide. It covers the following topics:
- Workload simulation – Designing simulations that reflect real‑user behavior and stateful context accumulation, rather than relying on stateless synthetic averages.
- Multi‑objective optimization – Efficiently searching the vLLM parameter space and seeing firsthand how optimization cycles dramatically improve performance on your GPUs.
- Observability stack – Setting up Prometheus and DCGM Exporter to obtain simultaneous visibility into inference‑engine internals and GPU hardware state.
- Kernel tracing – Capturing and interpreting NVIDIA Nsight Systems traces from a containerized vLLM deployment under load.
Key Findings
Chunked prefill is a vital trade‑off
- To protect inter‑token latency (ITL) for ongoing generations from massive prefill spikes caused by 128k‑token users,
--max-num-batched-tokensmust be tuned carefully. - We found that setting it to 2048 on a 4× H100 system or 1024 on a 2× H200 system sacrifices a bit of TTFT speed but yields smooth streaming and prevents CUDA‑graph compilation timeouts.
- To protect inter‑token latency (ITL) for ongoing generations from massive prefill spikes caused by 128k‑token users,
GPU utilization is not an SLA metric
- At the capacity ceiling we measured ~37 % SM active.
- Although it looks like ~60 % of compute capacity is idle, pushing utilization higher fills scheduling gaps but degrades per‑step decode latency (ITL) and causes SLA violations.
VRAM is not always the bottleneck
- Even with 10 % of users submitting massive 80k–128k‑token contexts, active KV‑cache usage stayed low (~10.5 % on 4× H100).
- Because the dataset simulates a shared enterprise monorepo, vLLM’s prefix caching deduplicates shared roots efficiently. The system is compute‑bound by attention kernels and memory bandwidth, not VRAM capacity.
Hardware scaling is non‑linear under tail‑latency constraints
- The 4× H100 system achieved ~3.5× the capacity of the 2× H200 system (300 vs. 85 users), rather than the expected 2×.
- This stems from aggregate memory‑bandwidth gains, Tensor‑Parallelism math division, and the chunked‑prefill penalty on smaller GPU clusters.
Thermal vulnerabilities in Tensor Parallelism
- With TP > 1, the entire inference step proceeds at the speed of the slowest GPU.
- A single GPU that throttles thermally forces all healthy GPUs to wait at NVLink synchronization barriers, causing severe, system‑wide latency spikes.
Hardware profiling realities vs. theoretical models
- Assumptions about quantization can mislead capacity planning.
- For example, while gpt‑oss‑120b stores expert weights in MXFP4 (4‑bit), vLLM on H100s unpacks them to BF16 in SM registers before matrix multiplication (W4A16).
- Assuming the model runs entirely in FP4 leads to mis‑predicting the bottleneck regime—a discrepancy confirmed by our kernel profiling.
Read the White Paper
We cannot claim to know the optimal number of users for your deployment; each deployment has a unique combination of model, hardware, workload mix, and latency targets that produce different target numbers. The value derived from our research is in the methodology detailed in our white paper: a repeatable process for finding your own answer with confidence.
The full paper is available here: SPOC: a Stateful, Profile‑based Optimization for LLM Capacity Planning Methodology.
We would love to hear how it goes if you adapt the framework to your own setup. The best benchmarks are the ones that reflect your actual users.
Discover More from the VMware Cloud Foundation (VCF) Blog
Subscribe to get the latest posts sent to your email.