[Paper] LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Published: 3 days ago (February 26, 2026 at 09:22 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.23036v1

Overview

LLMServingSim 2.0 is a system‑level simulator that lets engineers explore how heterogeneous accelerators (GPUs, TPUs, emerging near‑memory chips) and disaggregated serving architectures (separate compute, memory, and model shards) interact at runtime. By unifying hardware and software decisions in a single simulation loop, the tool makes it possible to predict latency, memory usage, and power for complex LLM deployments with near‑real‑world accuracy.

Key Contributions

Unified runtime‑driven simulation that couples serving‑stack decisions (batching, routing, offloading) with detailed hardware behavior.
Profile‑based extensibility for plugging in new accelerators, memory technologies, and interconnects without rewriting the core simulator.
High fidelity validation: average error < 1 % on latency, memory, and power when compared against production clusters.
Fast turnaround: end‑to‑end runs of realistic configurations complete in ~10 minutes on a single workstation.
Open‑source reference implementation (released under a permissive license) with documentation and example workloads.

Methodology

Runtime Loop Integration – The simulator models a single “serving tick” where it first applies scheduling policies (e.g., which request goes to which accelerator), then updates hardware state (resource occupancy, memory bandwidth, power draw), and finally advances time. This tight loop captures feedback effects such as queue buildup or memory contention.
Profile‑Based Hardware Models – Each accelerator or memory module is described by a JSON/YAML profile containing latency tables, bandwidth limits, power curves, and compute throughput. Adding a new device is as simple as supplying a calibrated profile.
Disaggregated Component Modeling – Compute nodes, memory pools, and model‑shard repositories are instantiated as separate entities linked by a configurable interconnect (PCIe, NVLink, CXL). Data movement costs are computed per‑request based on the chosen routing policy.
Serving Stack Hooks – The simulator exposes APIs that mimic popular serving frameworks (e.g., vLLM, TGI). Researchers can plug in custom batching or routing algorithms and see their impact instantly.
Validation Suite – Real‑world traces from a multi‑GPU cluster running GPT‑3‑style workloads were used to calibrate profiles and verify that simulated latency, memory footprint, and power match measured values.

Results & Findings

Metric	Simulated vs. Real	Avg. Error
End‑to‑end request latency	99.2 % of observed	0.8 %
Peak memory consumption	100.1 % of observed	0.1 %
Power draw (cluster‑wide)	98.9 % of observed	1.1 %
Simulation time (complex config)	~10 min vs. hours of real run	—

Key takeaways

Heterogeneity matters – Mixing a high‑throughput GPU with a low‑latency near‑memory accelerator can reduce tail latency by up to 30 % when the scheduler is aware of the trade‑offs.
Disaggregation overhead – Offloading model shards to a remote memory pool adds ~2 µs per token; however, the same offload can free on‑chip memory, enabling larger batch sizes that offset the cost.
Power‑aware routing – Simple power‑capped policies can shave 15 % of energy consumption with < 5 % latency penalty, a trade‑off that is difficult to discover without a simulator.

Practical Implications

Accelerator vendors can use LLMServingSim 2.0 to benchmark new chips in realistic serving pipelines before silicon is available, guiding design choices (e.g., memory bandwidth vs. compute density).
Cloud providers gain a sandbox for evaluating disaggregated architectures (CXL‑based memory pools, composable compute) and for sizing SLAs based on predicted tail‑latency under mixed workloads.
ML engineers can experiment with custom batching or token‑routing strategies and instantly see how they affect cost and latency, accelerating the iteration cycle from days to minutes.
Tooling ecosystem – Because the simulator mimics popular serving APIs, it can be integrated into CI pipelines, enabling automated regression testing of new hardware‑software co‑designs.

Limitations & Future Work

Model granularity – The current profiles abstract away micro‑architectural details (e.g., cache hierarchy effects), which may matter for ultra‑low‑latency use cases.
Network topology – Only a few standard interconnect topologies are pre‑modeled; more exotic fabrics (e.g., hierarchical CXL fabrics) require manual extension.
Workload diversity – Validation focused on autoregressive LLM inference; future work will broaden to retrieval‑augmented generation, fine‑tuning, and multi‑modal models.
Dynamic scaling – The simulator assumes a static cluster size; adding support for elastic scaling (autoscaling nodes on demand) is on the roadmap.

LLMServingSim 2.0 bridges the gap between hardware innovation and serving‑system design, giving developers a practical, fast, and accurate way to explore the next generation of heterogeneous, disaggregated LLM infrastructures.

Authors

Jaehong Cho
Hyunmin Choi
Guseul Heo
Jongse Park

Paper Information

arXiv ID: 2602.23036v1
Categories: cs.DC, cs.AI
Published: February 26, 2026
PDF: Download PDF

[Paper] LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport