[Paper] ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
Source: arXiv - 2605.06534v1
Overview
The paper introduces ROSE, a system that lets you tap into idle GPU capacity in production serving clusters to speed up the costly rollout phase of agentic reinforcement learning (RL) for large language models (LLMs). By sharing GPUs between serving traffic and RL rollouts in a “co‑elastic” fashion, ROSE achieves up to a 3.3× boost in end‑to‑end training throughput without breaking the latency guarantees of the serving side.
Key Contributions
- Co‑elastic GPU sharing model – Demonstrates that serving clusters typically have surplus GPU memory/compute that can be safely harvested for RL rollouts.
- SLO‑safe co‑serving executor – A runtime that multiplexes serving and rollout kernels on the same GPU while guaranteeing serving Service Level Objectives (latency, throughput).
- Cross‑cluster weight transfer engine – Uses weight sharding and sparsity‑aware compression to synchronize policy weights between the rollout pool and the serving pool with minimal bandwidth.
- Elastic rollout scheduler – Dynamically decides how many rollout jobs to place on dedicated rollout GPUs vs. opportunistic serving GPUs, reacting to traffic bursts and GPU availability.
- Empirical validation – Shows 1.20–3.31× higher throughput across a range of model sizes (7B–70B) and cluster configurations compared to static‑GPU baselines and prior elastic systems.
Methodology
- Profiling serving clusters – The authors first measured real‑world GPU utilization in production inference services and found consistent headroom (≈30‑50 % memory, 20‑40 % compute).
- Design of the co‑serving executor
- Memory partitioning: Serves inference requests in a pre‑allocated memory region while allocating a separate region for rollout tensors.
- Compute interleaving: Uses CUDA streams and priority scheduling so that inference kernels pre‑empt rollout kernels when latency SLOs are at risk.
- Weight synchronization
- Model weights are split into shards; only the shards that changed significantly are transmitted.
- Sparsity‑aware compression (e.g., top‑k masking) reduces the payload, enabling fast cross‑cluster updates over commodity networking.
- Elastic scheduler
- Monitors serving request latency and GPU utilization in real time.
- When latency is comfortably below the SLO, the scheduler “leases” a portion of the GPU to rollout workers; when traffic spikes, it revokes the lease instantly.
- Evaluation setup
- Benchmarks on internal clusters (8‑GPU to 64‑GPU nodes) with agentic RL pipelines (e.g., ReAct‑style tool‑use tasks).
- Baselines include a static‑GPU rollout pool, an existing elastic framework (ElasticTrainer), and a naïve “share‑all” approach that ignores SLOs.
Results & Findings
| Metric | Static GPU baseline | ElasticTrainer | ROSE (best config) |
|---|---|---|---|
| End‑to‑end RL throughput (steps/s) | 1.0× (baseline) | 1.15× | 1.20–3.31× |
| Serving latency 99th‑pctile | 100 ms (target) | 120 ms (SLO breach) | ≤ 100 ms |
| GPU memory overhead for rollout | 0 % (unused) | 15 % (reserved) | 5 % |
| Network traffic for weight sync (GB/epoch) | 2.4 | 1.8 | 0.9 |
- Throughput gains grow with model size because larger models have bigger memory footprints, leaving more “spare” memory on serving GPUs that ROSE can exploit.
- SLO compliance is maintained: latency spikes never exceed the pre‑defined threshold, thanks to the priority‑based executor.
- Cross‑cluster sync reduces bandwidth by ~60 % versus naïve full‑model broadcast, making the system viable even on standard Ethernet.
Practical Implications
- Cost savings – Companies can squeeze more RL training work out of existing inference hardware, delaying or avoiding expensive GPU purchases.
- Faster iteration on agentic LLMs – Shorter rollout times mean quicker feedback loops for tool‑use and reasoning research, accelerating product feature roll‑outs.
- Zero‑downtime upgrades – Because ROSE never preempts inference requests beyond the SLO, production services stay responsive while training runs in the background.
- Generalizable pattern – The cooperative elasticity concept can be applied to other compute‑heavy workloads (e.g., diffusion model sampling, batch inference) that coexist with latency‑critical services.
- Implementation hints for engineers
- Use CUDA streams with
cudaStreamPriorityto enforce inference priority. - Partition GPU memory via
cudaMallocManagedor explicit memory pools to avoid fragmentation. - Adopt a lightweight RPC (e.g., gRPC with protobuf) for weight shard exchange, combined with a simple top‑k compressor.
- Use CUDA streams with
Limitations & Future Work
- Assumes predictable serving headroom – In highly volatile traffic patterns, the amount of idle GPU may shrink, limiting rollout gains.
- GPU heterogeneity – The current prototype targets homogeneous GPU clusters; mixed‑generation fleets would need more sophisticated scheduling heuristics.
- Security & isolation – Running training kernels on the same GPU as production inference raises concerns about side‑channel leakage; the paper suggests sandboxing but does not evaluate it.
- Future directions proposed by the authors include: extending ROSE to multi‑node TPU clusters, integrating more advanced weight compression (e.g., quantized diff‑sync), and exploring formal SLO verification methods.
Authors
- Wei Gao
- Yuheng Zhao
- Dilxat Muhtar
- Dakai An
- Xuchun Shang
- Tianyuan Wu
- Lunxi Cao
- Shaopan Xiong
- Weixun Wang
- Ju Huang
- Teng Ma
- Siran Yang
- Jiamang Wang
- Lin Qu
- Bo Zheng
- Wei Wang
Paper Information
- arXiv ID: 2605.06534v1
- Categories: cs.DC
- Published: May 7, 2026
- PDF: Download PDF