[Paper] ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management
Source: arXiv - 2601.21473v1
Overview
The paper introduces ScaleSim, a system that makes it practical to run thousands of LLM‑powered agents in a single simulation without blowing out GPU memory. By observing that agents are only occasionally active and that their future activation order can be predicted, the authors devise a new “invocation distance” abstraction that drives smarter memory prefetching and eviction, delivering noticeable speedups on real‑world simulation workloads.
Key Contributions
- Invocation Distance abstraction – a lightweight metric that estimates how far away each agent’s next LLM request is, enabling proactive memory management.
- Proactive prefetching & priority‑based eviction – agents with short invocation distances are kept resident, while those far in the future are swapped out, reducing GPU memory pressure.
- Modular memory interface – supports heterogeneous per‑agent state (model weights, prefix caches, adapters, etc.) without hard‑coding any specific representation.
- ScaleSim runtime – a drop‑in serving layer that integrates with existing LLM back‑ends (e.g., SGLang) and delivers up to 1.74× speedup on multi‑agent benchmarks.
- Comprehensive workload analysis – characterizes real simulation workloads to validate the sparsity of agent activation and the predictability of invocation order.
Methodology
-
Workload Characterization – The authors profile several representative multi‑agent simulations (e.g., game AI, economic modeling) and find two recurring patterns:
- Sparse activation: at any given step, only a small subset of agents actually issue LLM calls.
- Predictable ordering: the sequence in which agents will be invoked can be estimated from the simulation’s control flow.
-
Defining Invocation Distance – For each agent, the system tracks the number of steps (or time) until its next expected LLM request. This distance is continuously updated as the simulation progresses.
-
Memory Management Policy –
- Prefetching: when an agent’s distance drops below a configurable threshold, its private state (model shard, cache, adapters) is proactively loaded onto the GPU.
- Eviction: agents with the largest distances are selected for eviction first, freeing space for imminent agents.
- The policy is implemented as a priority queue keyed by invocation distance, allowing O(log N) updates.
-
Modular State Interface – Developers can plug in custom per‑agent data structures (e.g., LoRA adapters, prompt prefixes) by implementing a small API; ScaleSim handles the movement of these blobs between host and device memory.
-
Integration & Evaluation – ScaleSim is built on top of the SGLang serving stack. Benchmarks compare raw SGLang, SGLang + naive swapping, and ScaleSim across varying agent counts and model sizes.
Results & Findings
| Metric | Baseline (SGLang) | SGLang + Naïve Swap | ScaleSim |
|---|---|---|---|
| Throughput (agents · steps/s) | 1.0× | 1.12× | 1.74× |
| Peak GPU memory usage | 100 % (max) | 78 % (due to aggressive swapping) | 55 % |
| Latency per LLM call (ms) | 120 | 135 (swap overhead) | 95 |
| Scaling to #agents | 500 | 800 | >1500 |
- Speedup: ScaleSim’s prefetch‑evict strategy cuts the average per‑call latency by ~20 % and more than doubles overall throughput when the simulation runs >1 k agents.
- Memory savings: By keeping only the “near‑future” agents resident, GPU memory consumption drops by roughly half, allowing larger base models (e.g., 13 B parameters) to be used.
- Robustness: The system gracefully handles dynamic changes in activation patterns; the invocation distance metric adapts on‑the‑fly without needing a full re‑analysis.
Practical Implications
- Game & Virtual World AI: Studios can now populate massive open‑world environments with LLM‑driven NPCs without needing a farm of GPUs.
- Economic & Social Simulations: Researchers can scale agent counts into the tens of thousands, enabling richer scenario testing (e.g., market dynamics, pandemic modeling).
- Edge & Cloud Hybrid Deployments: The modular memory interface lets developers offload rarely‑used agent state to host RAM or even remote storage, keeping only hot agents on expensive GPU instances.
- Cost Reduction: Lower GPU memory footprints translate directly into cheaper cloud GPU rentals or the ability to fit more agents on a single workstation.
- Developer Productivity: ScaleSim works as a thin layer over existing LLM serving stacks, meaning teams can adopt it without rewriting their simulation logic.
Limitations & Future Work
- Prediction Accuracy: Invocation distance relies on the simulation’s control flow being reasonably predictable; highly stochastic or adversarial agent schedules could degrade performance.
- Overhead of Distance Updates: Maintaining the priority queue adds modest CPU overhead, which may become noticeable in ultra‑high‑frequency simulations.
- Support for Multi‑GPU / Distributed Settings: The current prototype targets a single GPU; extending the policy across multiple devices or a cluster is left for future exploration.
- Dynamic Model Updates: The system assumes static per‑agent models; handling on‑the‑fly fine‑tuning or adapter swaps would require additional bookkeeping.
The authors suggest investigating learning‑based predictors for invocation distance, integrating with distributed tensor parallelism, and exploring tighter coupling with LLM inference kernels to further shrink latency.
Authors
- Zaifeng Pan
- Yipeng Shen
- Zhengding Hu
- Zhuang Wang
- Aninda Manocha
- Zheng Wang
- Zhongkai Yu
- Yue Guan
- Yufei Ding
Paper Information
- arXiv ID: 2601.21473v1
- Categories: cs.AI, cs.DC
- Published: January 29, 2026
- PDF: Download PDF