[Paper] RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training

Published: 1 month ago (December 12, 2025 at 01:03 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11306v1

Overview

RollMux tackles a bottleneck that has emerged as reinforcement‑learning (RL) workloads move to disaggregated architectures—separate clusters for the rollout (data‑generation) phase and the training (model‑update) phase. Because on‑policy algorithms need strict synchronization between these phases, one cluster often sits idle while the other is busy, wasting expensive GPU resources. RollMux introduces a cross‑cluster scheduling framework that “fills” these idle periods, delivering up to 1.8× better cost efficiency on a production‑scale GPU testbed.

Key Contributions

Co‑execution Group Abstraction: Partitions the overall hardware pool into isolated locality domains, allowing jobs to share resources without interfering with each other’s memory footprint.
Two‑Tier Scheduler:
- Inter‑group scheduler uses conservative stochastic planning to decide where to place each RL job (rollout vs. training) across groups.
- Intra‑group scheduler implements a provably optimal round‑robin scheme that maximizes GPU utilization within each group.
Warm‑Star Context Switching: Enforces a residency constraint so that large model states stay cached in host memory, enabling near‑instant switching between rollout and training phases.
Production‑Scale Evaluation: Demonstrates 1.84× cost‑efficiency improvement over vanilla disaggregation and 1.38× over the best co‑located baselines on a 656‑GPU (328 H20 + 328 H800) cluster, with 100 % service‑level‑objective (SLO) compliance.

Methodology

Problem Modeling: The authors model the RL pipeline as two alternating, resource‑heavy phases (rollout = memory‑bound, training = compute‑bound) that must stay synchronized.
Group Formation: The hardware pool is split into co‑execution groups—each group contains a set of GPUs and associated host memory that can be reserved for a single job’s entire lifecycle. This isolates the massive model state and avoids costly data movement.
Inter‑Group Scheduling: A stochastic planner evaluates the expected idle time (“bubble”) each phase would generate in a candidate group and assigns jobs to groups that minimize overall bubble cost. The planner is conservative: it prefers placements that guarantee SLO adherence even under workload variance.
Intra‑Group Scheduling: Within a group, RollMux runs a round‑robin schedule that alternates rollout and training tasks from different jobs, effectively “multiplexing” the GPUs. The authors prove that this schedule maximizes utilization given the fixed group size and residency constraints.
Implementation & Integration: The framework hooks into existing RL orchestration stacks (e.g., Ray RLlib) and leverages standard container runtimes, requiring only a lightweight daemon to enforce group boundaries and perform the scheduling decisions.

Results & Findings

Metric	Baseline (plain disaggregation)	State‑of‑the‑art co‑located	RollMux
Cost efficiency (throughput per $)	1.0×	1.38×	1.84×
GPU utilization (average)	~45 %	~60 %	~82 %
SLO attainment (deadline compliance)	96 %	98 %	100 %
Warm‑star latency (phase switch)	120 ms	95 ms	≈30 ms

Key takeaways

By overlapping the idle “bubble” of one phase with the active phase of another, RollMux eliminates most of the dead time that plagues on‑policy RL pipelines.
The residency constraint keeps the model in host memory, cutting the context‑switch overhead by more than a factor of three.
Even under heavy load (full 656‑GPU cluster), the scheduler maintains deterministic SLO guarantees, a critical requirement for production RL services.

Practical Implications

Lower Cloud Bills: Companies running large‑scale RL (e.g., robotics, recommendation systems, autonomous driving simulators) can achieve near‑double the throughput for the same GPU spend.
Simplified Cluster Ops: The group abstraction lets ops teams allocate a fixed “slot” per RL job, avoiding ad‑hoc memory‑pinning tricks and reducing the risk of out‑of‑memory crashes.
Faster Experiment Turn‑around: Warm‑star context switching means developers can iterate on policy updates without waiting for long data‑generation phases to finish, accelerating the research‑to‑production cycle.
Compatibility: RollMux works as a plug‑in on top of popular RL frameworks, so existing codebases need minimal changes—primarily configuration of group sizes and residency policies.

Limitations & Future Work

On‑Policy Focus: The current design assumes strict rollout‑training synchronization; off‑policy or asynchronous RL algorithms may not benefit as much.
Static Group Sizes: Groups are defined at job start‑up; dynamic resizing (e.g., scaling out when a job spikes) is not yet supported.
Hardware Diversity: Evaluation was performed on homogeneous NVIDIA H20/H800 GPUs; heterogeneous accelerators (TPUs, AMD GPUs) could introduce new scheduling challenges.
Future Directions: Extending the stochastic planner to handle heterogeneous resources, supporting dynamic group re‑partitioning, and exploring applicability to other pipeline‑style workloads (e.g., video transcoding, large‑scale data preprocessing).

Authors

Tianyuan Wu
Lunxi Cao
Yining Wei
Wei Gao
Yuheng Zhao
Dakai An
Shaopan Xiong
Zhiqiang Lv
Ju Huang
Siran Yang
Yinghao Yu
Jiamang Wang
Lin Qu
Wei Wang

Paper Information

arXiv ID: 2512.11306v1
Categories: cs.DC
Published: December 12, 2025
PDF: Download PDF

[Paper] RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Hypergraph based Multi-Party Payment Channel

[Paper] Stateless Snowflake: A Cloud-Agnostic Distributed ID Generator Using Network-Derived Identity

[Paper] FirecREST v2: lessons learned from redesigning an API for scalable HPC resource access

[Paper] Enhanced Pruning for Distributed Closeness Centrality under Multi-Packet Messaging