[Paper] ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

Published: 3 days ago (May 7, 2026 at 12:33 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06534v1

Overview

The paper introduces ROSE, a system that lets you tap into idle GPU capacity in production serving clusters to speed up the costly rollout phase of agentic reinforcement learning (RL) for large language models (LLMs). By sharing GPUs between serving traffic and RL rollouts in a “co‑elastic” fashion, ROSE achieves up to a 3.3× boost in end‑to‑end training throughput without breaking the latency guarantees of the serving side.

Key Contributions

Co‑elastic GPU sharing model – Demonstrates that serving clusters typically have surplus GPU memory/compute that can be safely harvested for RL rollouts.
SLO‑safe co‑serving executor – A runtime that multiplexes serving and rollout kernels on the same GPU while guaranteeing serving Service Level Objectives (latency, throughput).
Cross‑cluster weight transfer engine – Uses weight sharding and sparsity‑aware compression to synchronize policy weights between the rollout pool and the serving pool with minimal bandwidth.
Elastic rollout scheduler – Dynamically decides how many rollout jobs to place on dedicated rollout GPUs vs. opportunistic serving GPUs, reacting to traffic bursts and GPU availability.
Empirical validation – Shows 1.20–3.31× higher throughput across a range of model sizes (7B–70B) and cluster configurations compared to static‑GPU baselines and prior elastic systems.

Methodology

Profiling serving clusters – The authors first measured real‑world GPU utilization in production inference services and found consistent headroom (≈30‑50 % memory, 20‑40 % compute).
Design of the co‑serving executor
- Memory partitioning: Serves inference requests in a pre‑allocated memory region while allocating a separate region for rollout tensors.
- Compute interleaving: Uses CUDA streams and priority scheduling so that inference kernels pre‑empt rollout kernels when latency SLOs are at risk.
Weight synchronization
- Model weights are split into shards; only the shards that changed significantly are transmitted.
- Sparsity‑aware compression (e.g., top‑k masking) reduces the payload, enabling fast cross‑cluster updates over commodity networking.
Elastic scheduler
- Monitors serving request latency and GPU utilization in real time.
- When latency is comfortably below the SLO, the scheduler “leases” a portion of the GPU to rollout workers; when traffic spikes, it revokes the lease instantly.
Evaluation setup
- Benchmarks on internal clusters (8‑GPU to 64‑GPU nodes) with agentic RL pipelines (e.g., ReAct‑style tool‑use tasks).
- Baselines include a static‑GPU rollout pool, an existing elastic framework (ElasticTrainer), and a naïve “share‑all” approach that ignores SLOs.

Results & Findings

Metric	Static GPU baseline	ElasticTrainer	ROSE (best config)
End‑to‑end RL throughput (steps/s)	1.0× (baseline)	1.15×	1.20–3.31×
Serving latency 99th‑pctile	100 ms (target)	120 ms (SLO breach)	≤ 100 ms
GPU memory overhead for rollout	0 % (unused)	15 % (reserved)	5 %
Network traffic for weight sync (GB/epoch)	2.4	1.8	0.9

Throughput gains grow with model size because larger models have bigger memory footprints, leaving more “spare” memory on serving GPUs that ROSE can exploit.
SLO compliance is maintained: latency spikes never exceed the pre‑defined threshold, thanks to the priority‑based executor.
Cross‑cluster sync reduces bandwidth by ~60 % versus naïve full‑model broadcast, making the system viable even on standard Ethernet.

Practical Implications

Cost savings – Companies can squeeze more RL training work out of existing inference hardware, delaying or avoiding expensive GPU purchases.
Faster iteration on agentic LLMs – Shorter rollout times mean quicker feedback loops for tool‑use and reasoning research, accelerating product feature roll‑outs.
Zero‑downtime upgrades – Because ROSE never preempts inference requests beyond the SLO, production services stay responsive while training runs in the background.
Generalizable pattern – The cooperative elasticity concept can be applied to other compute‑heavy workloads (e.g., diffusion model sampling, batch inference) that coexist with latency‑critical services.
Implementation hints for engineers
- Use CUDA streams with cudaStreamPriority to enforce inference priority.
- Partition GPU memory via cudaMallocManaged or explicit memory pools to avoid fragmentation.
- Adopt a lightweight RPC (e.g., gRPC with protobuf) for weight shard exchange, combined with a simple top‑k compressor.

Limitations & Future Work

Assumes predictable serving headroom – In highly volatile traffic patterns, the amount of idle GPU may shrink, limiting rollout gains.
GPU heterogeneity – The current prototype targets homogeneous GPU clusters; mixed‑generation fleets would need more sophisticated scheduling heuristics.
Security & isolation – Running training kernels on the same GPU as production inference raises concerns about side‑channel leakage; the paper suggests sandboxing but does not evaluate it.
Future directions proposed by the authors include: extending ROSE to multi‑node TPU clusters, integrating more advanced weight compression (e.g., quantized diff‑sync), and exploring formal SLO verification methods.

Authors

Wei Gao
Yuheng Zhao
Dilxat Muhtar
Dakai An
Xuchun Shang
Tianyuan Wu
Lunxi Cao
Shaopan Xiong
Weixun Wang
Ju Huang
Teng Ma
Siran Yang
Jiamang Wang
Lin Qu
Bo Zheng
Wei Wang

Paper Information

arXiv ID: 2605.06534v1
Categories: cs.DC
Published: May 7, 2026
PDF: Download PDF

[Paper] ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole