[Paper] ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management

Published: 3 months ago (January 29, 2026 at 04:52 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.21473v1

Overview

The paper introduces ScaleSim, a system that makes it practical to run thousands of LLM‑powered agents in a single simulation without blowing out GPU memory. By observing that agents are only occasionally active and that their future activation order can be predicted, the authors devise a new “invocation distance” abstraction that drives smarter memory prefetching and eviction, delivering noticeable speedups on real‑world simulation workloads.

Key Contributions

Invocation Distance abstraction – a lightweight metric that estimates how far away each agent’s next LLM request is, enabling proactive memory management.
Proactive prefetching & priority‑based eviction – agents with short invocation distances are kept resident, while those far in the future are swapped out, reducing GPU memory pressure.
Modular memory interface – supports heterogeneous per‑agent state (model weights, prefix caches, adapters, etc.) without hard‑coding any specific representation.
ScaleSim runtime – a drop‑in serving layer that integrates with existing LLM back‑ends (e.g., SGLang) and delivers up to 1.74× speedup on multi‑agent benchmarks.
Comprehensive workload analysis – characterizes real simulation workloads to validate the sparsity of agent activation and the predictability of invocation order.

Methodology

Workload Characterization – The authors profile several representative multi‑agent simulations (e.g., game AI, economic modeling) and find two recurring patterns:
- Sparse activation: at any given step, only a small subset of agents actually issue LLM calls.
- Predictable ordering: the sequence in which agents will be invoked can be estimated from the simulation’s control flow.
Defining Invocation Distance – For each agent, the system tracks the number of steps (or time) until its next expected LLM request. This distance is continuously updated as the simulation progresses.
Memory Management Policy –
- Prefetching: when an agent’s distance drops below a configurable threshold, its private state (model shard, cache, adapters) is proactively loaded onto the GPU.
- Eviction: agents with the largest distances are selected for eviction first, freeing space for imminent agents.
- The policy is implemented as a priority queue keyed by invocation distance, allowing O(log N) updates.
Modular State Interface – Developers can plug in custom per‑agent data structures (e.g., LoRA adapters, prompt prefixes) by implementing a small API; ScaleSim handles the movement of these blobs between host and device memory.
Integration & Evaluation – ScaleSim is built on top of the SGLang serving stack. Benchmarks compare raw SGLang, SGLang + naive swapping, and ScaleSim across varying agent counts and model sizes.

Results & Findings

Metric	Baseline (SGLang)	SGLang + Naïve Swap	ScaleSim
Throughput (agents · steps/s)	1.0×	1.12×	1.74×
Peak GPU memory usage	100 % (max)	78 % (due to aggressive swapping)	55 %
Latency per LLM call (ms)	120	135 (swap overhead)	95
Scaling to #agents	500	800	>1500

Speedup: ScaleSim’s prefetch‑evict strategy cuts the average per‑call latency by ~20 % and more than doubles overall throughput when the simulation runs >1 k agents.
Memory savings: By keeping only the “near‑future” agents resident, GPU memory consumption drops by roughly half, allowing larger base models (e.g., 13 B parameters) to be used.
Robustness: The system gracefully handles dynamic changes in activation patterns; the invocation distance metric adapts on‑the‑fly without needing a full re‑analysis.

Practical Implications

Game & Virtual World AI: Studios can now populate massive open‑world environments with LLM‑driven NPCs without needing a farm of GPUs.
Economic & Social Simulations: Researchers can scale agent counts into the tens of thousands, enabling richer scenario testing (e.g., market dynamics, pandemic modeling).
Edge & Cloud Hybrid Deployments: The modular memory interface lets developers offload rarely‑used agent state to host RAM or even remote storage, keeping only hot agents on expensive GPU instances.
Cost Reduction: Lower GPU memory footprints translate directly into cheaper cloud GPU rentals or the ability to fit more agents on a single workstation.
Developer Productivity: ScaleSim works as a thin layer over existing LLM serving stacks, meaning teams can adopt it without rewriting their simulation logic.

Limitations & Future Work

Prediction Accuracy: Invocation distance relies on the simulation’s control flow being reasonably predictable; highly stochastic or adversarial agent schedules could degrade performance.
Overhead of Distance Updates: Maintaining the priority queue adds modest CPU overhead, which may become noticeable in ultra‑high‑frequency simulations.
Support for Multi‑GPU / Distributed Settings: The current prototype targets a single GPU; extending the policy across multiple devices or a cluster is left for future exploration.
Dynamic Model Updates: The system assumes static per‑agent models; handling on‑the‑fly fine‑tuning or adapter swaps would require additional bookkeeping.

The authors suggest investigating learning‑based predictors for invocation distance, integrating with distributed tensor parallelism, and exploring tighter coupling with LLM inference kernels to further shrink latency.

Authors

Zaifeng Pan
Yipeng Shen
Zhengding Hu
Zhuang Wang
Aninda Manocha
Zheng Wang
Zhongkai Yu
Yue Guan
Yufei Ding

Paper Information

arXiv ID: 2601.21473v1
Categories: cs.AI, cs.DC
Published: January 29, 2026
PDF: Download PDF

[Paper] ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound