[Paper] LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Published: (February 26, 2026 at 09:22 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.23036v1

Overview

LLMServingSim 2.0 is a system‑level simulator that lets engineers explore how heterogeneous accelerators (GPUs, TPUs, emerging near‑memory chips) and disaggregated serving architectures (separate compute, memory, and model shards) interact at runtime. By unifying hardware and software decisions in a single simulation loop, the tool makes it possible to predict latency, memory usage, and power for complex LLM deployments with near‑real‑world accuracy.

Key Contributions

  • Unified runtime‑driven simulation that couples serving‑stack decisions (batching, routing, offloading) with detailed hardware behavior.
  • Profile‑based extensibility for plugging in new accelerators, memory technologies, and interconnects without rewriting the core simulator.
  • High fidelity validation: average error < 1 % on latency, memory, and power when compared against production clusters.
  • Fast turnaround: end‑to‑end runs of realistic configurations complete in ~10 minutes on a single workstation.
  • Open‑source reference implementation (released under a permissive license) with documentation and example workloads.

Methodology

  1. Runtime Loop Integration – The simulator models a single “serving tick” where it first applies scheduling policies (e.g., which request goes to which accelerator), then updates hardware state (resource occupancy, memory bandwidth, power draw), and finally advances time. This tight loop captures feedback effects such as queue buildup or memory contention.
  2. Profile‑Based Hardware Models – Each accelerator or memory module is described by a JSON/YAML profile containing latency tables, bandwidth limits, power curves, and compute throughput. Adding a new device is as simple as supplying a calibrated profile.
  3. Disaggregated Component Modeling – Compute nodes, memory pools, and model‑shard repositories are instantiated as separate entities linked by a configurable interconnect (PCIe, NVLink, CXL). Data movement costs are computed per‑request based on the chosen routing policy.
  4. Serving Stack Hooks – The simulator exposes APIs that mimic popular serving frameworks (e.g., vLLM, TGI). Researchers can plug in custom batching or routing algorithms and see their impact instantly.
  5. Validation Suite – Real‑world traces from a multi‑GPU cluster running GPT‑3‑style workloads were used to calibrate profiles and verify that simulated latency, memory footprint, and power match measured values.

Results & Findings

MetricSimulated vs. RealAvg. Error
End‑to‑end request latency99.2 % of observed0.8 %
Peak memory consumption100.1 % of observed0.1 %
Power draw (cluster‑wide)98.9 % of observed1.1 %
Simulation time (complex config)~10 min vs. hours of real run

Key takeaways

  • Heterogeneity matters – Mixing a high‑throughput GPU with a low‑latency near‑memory accelerator can reduce tail latency by up to 30 % when the scheduler is aware of the trade‑offs.
  • Disaggregation overhead – Offloading model shards to a remote memory pool adds ~2 µs per token; however, the same offload can free on‑chip memory, enabling larger batch sizes that offset the cost.
  • Power‑aware routing – Simple power‑capped policies can shave 15 % of energy consumption with < 5 % latency penalty, a trade‑off that is difficult to discover without a simulator.

Practical Implications

  • Accelerator vendors can use LLMServingSim 2.0 to benchmark new chips in realistic serving pipelines before silicon is available, guiding design choices (e.g., memory bandwidth vs. compute density).
  • Cloud providers gain a sandbox for evaluating disaggregated architectures (CXL‑based memory pools, composable compute) and for sizing SLAs based on predicted tail‑latency under mixed workloads.
  • ML engineers can experiment with custom batching or token‑routing strategies and instantly see how they affect cost and latency, accelerating the iteration cycle from days to minutes.
  • Tooling ecosystem – Because the simulator mimics popular serving APIs, it can be integrated into CI pipelines, enabling automated regression testing of new hardware‑software co‑designs.

Limitations & Future Work

  • Model granularity – The current profiles abstract away micro‑architectural details (e.g., cache hierarchy effects), which may matter for ultra‑low‑latency use cases.
  • Network topology – Only a few standard interconnect topologies are pre‑modeled; more exotic fabrics (e.g., hierarchical CXL fabrics) require manual extension.
  • Workload diversity – Validation focused on autoregressive LLM inference; future work will broaden to retrieval‑augmented generation, fine‑tuning, and multi‑modal models.
  • Dynamic scaling – The simulator assumes a static cluster size; adding support for elastic scaling (autoscaling nodes on demand) is on the roadmap.

LLMServingSim 2.0 bridges the gap between hardware innovation and serving‑system design, giving developers a practical, fast, and accurate way to explore the next generation of heterogeneous, disaggregated LLM infrastructures.

Authors

  • Jaehong Cho
  • Hyunmin Choi
  • Guseul Heo
  • Jongse Park

Paper Information

  • arXiv ID: 2602.23036v1
  • Categories: cs.DC, cs.AI
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...