How Many Users Can Your LLM Server Really Handle?

Published: (April 30, 2026 at 09:17 PM EDT)
7 min read
Source: VMware Blog

Source: VMware Blog

The Problem

Infrastructure engineers often face a vast configuration space and ask questions such as:

  • Will tuning --max-num-batched-tokens or --gpu-memory-utilization in vLLM improve throughput?
  • Could these changes unintentionally degrade tail latency?

The official vLLM documentation explains how to tune these parameters, but it rarely provides a systematic method for discovering the optimal configuration for a specific workload, hardware architecture, and strict Service Level Agreement (SLA).

Our Solution

We conducted a comprehensive capacity‑planning initiative for a 120‑billion‑parameter Mixture‑of‑Experts (MoE) model (gpt‑oss‑120b) deployed across multiple NVIDIA H100 and H200 clusters to power an internal AI coding assistant.

Rather than simply publishing the final capacity metrics, we documented the end‑to‑end methodology we used to achieve them.

Read the full technical white paper:
SPOC: a Stateful, Profile‑based Optimization for LLM Capacity Planning Methodology

The white paper serves as a comprehensive guide to LLM performance engineering, equipping infrastructure teams with the analytical tools and empirical techniques required to:

  • Construct stateful, multi‑turn datasets that accurately simulate developers querying shared enterprise monorepos.
  • Apply multi‑objective evolutionary algorithms (Optuna NSGA‑II) to mathematically explore the inference engine’s parameter space, replacing heuristic guesswork with rigorous optimization.
  • Deploy an advanced telemetry stack (Prometheus + DCGM Exporter) to correlate internal inference‑engine metrics with physical hardware state.
  • Capture and interpret kernel‑level NVIDIA Nsight Systems traces to identify true architectural bottlenecks—often contrary to simple theoretical roofline predictions.

Who Should Read This?

If you are responsible for scaling LLM infrastructure, this paper provides the empirical blueprint needed to move from estimating capacity to systematically measuring and optimizing it.

The Problem with the Just Run a Benchmark Concept

Standard LLM benchmarks send a fixed prompt at a fixed concurrency and report average latency, or single‑turn metrics (MLPerf, GenAI Perf, InferenceMax). That works for leaderboard comparisons, but it falls short for capacity planning in real‑world use cases—e.g., many follow‑up questions for coding tasks or log analysis. In those scenarios, multi‑turn traffic simulation is a must.

Why Real Traffic Is Messy

Traffic segmentShare of usersRequest size (tokens)How it stresses the system
Short70 %5 k → 50 kSets the floor for time‑to‑first‑token (TTFT)
Medium20 %15 k → 120 kBalances TTFT and compute load
Large10 %75 k → >128 k (hits context limit)Dominates GPU memory bandwidth and prefill compute
  • Short requests dominate the request rate, determining the minimum latency users will see.
  • Large requests consume the most GPU memory and compute resources, often becoming the bottleneck.
  • Treating all traffic as “average‑sized” yields a single number that doesn’t predict where the system will actually break.

Bottom Line
We need a benchmark that reflects the heterogeneous mix of request sizes and multi‑turn interactions typical of production workloads—not just a single‑turn, average‑size test.

What We Built

The white paper describes a framework with three core stages:

1. Workload Modeling

  • User profiles – Defined three profiles (P0, P1, P2) calibrated from observed usage patterns. Each profile has its own prompt‑size distribution, output budget, and think time.
  • Stateful corpus – Built from open‑source trajectories:
  • Simulation – Used Locust to generate multi‑turn streaming conversations that mimic real developers interacting with a coding assistant. The simulation includes a Partial Common Ground geometry to emulate shared enterprise monorepos.
  • Instead of manual tuning or exhaustive grid search, we employed Optuna with its NSGA‑II sampler to explore the vLLM parameter space at our target concurrency.

  • NSGA‑II is a multi‑objective evolutionary algorithm that simultaneously optimizes:

    1. Throughput
    2. Time‑to‑first‑token (TTFT)
    3. Inter‑token latency
  • The algorithm discovers the Pareto front—the set of configurations where improving one metric would degrade another.

3. Kernel‑Level Profiling

  • Captured NVIDIA Nsight Systems traces during steady‑state load at capacity ceilings (300 concurrent users on 4 × H100, and 85 users on 2 × H200).

  • Decomposed GPU active time into functional categories:

    • Flash Attention
    • MoE Expert GEMMs
    • NCCL collectives
  • The traces revealed that, for this sparse MoE architecture at large batch sizes, the system becomes heavily bound by Attention compute and memory bandwidth, contradicting simple roofline predictions.

4. Scaling the Best Configuration

  • Swept the optimal configuration across various concurrency levels.
  • Collected metrics with Prometheus and DCGM Exporter hardware counters.

What You Will Learn from the Paper

The paper serves as both a reference and a practical guide. It covers the following topics:

  • Workload simulation – Designing simulations that reflect real‑user behavior and stateful context accumulation, rather than relying on stateless synthetic averages.
  • Multi‑objective optimization – Efficiently searching the vLLM parameter space and seeing firsthand how optimization cycles dramatically improve performance on your GPUs.
  • Observability stack – Setting up Prometheus and DCGM Exporter to obtain simultaneous visibility into inference‑engine internals and GPU hardware state.
  • Kernel tracing – Capturing and interpreting NVIDIA Nsight Systems traces from a containerized vLLM deployment under load.

Key Findings

  • Chunked prefill is a vital trade‑off

    • To protect inter‑token latency (ITL) for ongoing generations from massive prefill spikes caused by 128k‑token users, --max-num-batched-tokens must be tuned carefully.
    • We found that setting it to 2048 on a 4× H100 system or 1024 on a 2× H200 system sacrifices a bit of TTFT speed but yields smooth streaming and prevents CUDA‑graph compilation timeouts.
  • GPU utilization is not an SLA metric

    • At the capacity ceiling we measured ~37 % SM active.
    • Although it looks like ~60 % of compute capacity is idle, pushing utilization higher fills scheduling gaps but degrades per‑step decode latency (ITL) and causes SLA violations.
  • VRAM is not always the bottleneck

    • Even with 10 % of users submitting massive 80k–128k‑token contexts, active KV‑cache usage stayed low (~10.5 % on 4× H100).
    • Because the dataset simulates a shared enterprise monorepo, vLLM’s prefix caching deduplicates shared roots efficiently. The system is compute‑bound by attention kernels and memory bandwidth, not VRAM capacity.
  • Hardware scaling is non‑linear under tail‑latency constraints

    • The 4× H100 system achieved ~3.5× the capacity of the 2× H200 system (300 vs. 85 users), rather than the expected 2×.
    • This stems from aggregate memory‑bandwidth gains, Tensor‑Parallelism math division, and the chunked‑prefill penalty on smaller GPU clusters.
  • Thermal vulnerabilities in Tensor Parallelism

    • With TP > 1, the entire inference step proceeds at the speed of the slowest GPU.
    • A single GPU that throttles thermally forces all healthy GPUs to wait at NVLink synchronization barriers, causing severe, system‑wide latency spikes.
  • Hardware profiling realities vs. theoretical models

    • Assumptions about quantization can mislead capacity planning.
    • For example, while gpt‑oss‑120b stores expert weights in MXFP4 (4‑bit), vLLM on H100s unpacks them to BF16 in SM registers before matrix multiplication (W4A16).
    • Assuming the model runs entirely in FP4 leads to mis‑predicting the bottleneck regime—a discrepancy confirmed by our kernel profiling.

Read the White Paper

We cannot claim to know the optimal number of users for your deployment; each deployment has a unique combination of model, hardware, workload mix, and latency targets that produce different target numbers. The value derived from our research is in the methodology detailed in our white paper: a repeatable process for finding your own answer with confidence.

The full paper is available here: SPOC: a Stateful, Profile‑based Optimization for LLM Capacity Planning Methodology.

We would love to hear how it goes if you adapt the framework to your own setup. The best benchmarks are the ones that reflect your actual users.

Discover More from the VMware Cloud Foundation (VCF) Blog

Subscribe to get the latest posts sent to your email.

0 views
Back to Blog

Related posts

Read more »