[Paper] RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training
Source: arXiv - 2512.11306v1
Overview
RollMux tackles a bottleneck that has emerged as reinforcement‑learning (RL) workloads move to disaggregated architectures—separate clusters for the rollout (data‑generation) phase and the training (model‑update) phase. Because on‑policy algorithms need strict synchronization between these phases, one cluster often sits idle while the other is busy, wasting expensive GPU resources. RollMux introduces a cross‑cluster scheduling framework that “fills” these idle periods, delivering up to 1.8× better cost efficiency on a production‑scale GPU testbed.
Key Contributions
- Co‑execution Group Abstraction: Partitions the overall hardware pool into isolated locality domains, allowing jobs to share resources without interfering with each other’s memory footprint.
- Two‑Tier Scheduler:
- Inter‑group scheduler uses conservative stochastic planning to decide where to place each RL job (rollout vs. training) across groups.
- Intra‑group scheduler implements a provably optimal round‑robin scheme that maximizes GPU utilization within each group.
- Warm‑Star Context Switching: Enforces a residency constraint so that large model states stay cached in host memory, enabling near‑instant switching between rollout and training phases.
- Production‑Scale Evaluation: Demonstrates 1.84× cost‑efficiency improvement over vanilla disaggregation and 1.38× over the best co‑located baselines on a 656‑GPU (328 H20 + 328 H800) cluster, with 100 % service‑level‑objective (SLO) compliance.
Methodology
- Problem Modeling: The authors model the RL pipeline as two alternating, resource‑heavy phases (rollout = memory‑bound, training = compute‑bound) that must stay synchronized.
- Group Formation: The hardware pool is split into co‑execution groups—each group contains a set of GPUs and associated host memory that can be reserved for a single job’s entire lifecycle. This isolates the massive model state and avoids costly data movement.
- Inter‑Group Scheduling: A stochastic planner evaluates the expected idle time (“bubble”) each phase would generate in a candidate group and assigns jobs to groups that minimize overall bubble cost. The planner is conservative: it prefers placements that guarantee SLO adherence even under workload variance.
- Intra‑Group Scheduling: Within a group, RollMux runs a round‑robin schedule that alternates rollout and training tasks from different jobs, effectively “multiplexing” the GPUs. The authors prove that this schedule maximizes utilization given the fixed group size and residency constraints.
- Implementation & Integration: The framework hooks into existing RL orchestration stacks (e.g., Ray RLlib) and leverages standard container runtimes, requiring only a lightweight daemon to enforce group boundaries and perform the scheduling decisions.
Results & Findings
| Metric | Baseline (plain disaggregation) | State‑of‑the‑art co‑located | RollMux |
|---|---|---|---|
| Cost efficiency (throughput per $) | 1.0× | 1.38× | 1.84× |
| GPU utilization (average) | ~45 % | ~60 % | ~82 % |
| SLO attainment (deadline compliance) | 96 % | 98 % | 100 % |
| Warm‑star latency (phase switch) | 120 ms | 95 ms | ≈30 ms |
Key takeaways
- By overlapping the idle “bubble” of one phase with the active phase of another, RollMux eliminates most of the dead time that plagues on‑policy RL pipelines.
- The residency constraint keeps the model in host memory, cutting the context‑switch overhead by more than a factor of three.
- Even under heavy load (full 656‑GPU cluster), the scheduler maintains deterministic SLO guarantees, a critical requirement for production RL services.
Practical Implications
- Lower Cloud Bills: Companies running large‑scale RL (e.g., robotics, recommendation systems, autonomous driving simulators) can achieve near‑double the throughput for the same GPU spend.
- Simplified Cluster Ops: The group abstraction lets ops teams allocate a fixed “slot” per RL job, avoiding ad‑hoc memory‑pinning tricks and reducing the risk of out‑of‑memory crashes.
- Faster Experiment Turn‑around: Warm‑star context switching means developers can iterate on policy updates without waiting for long data‑generation phases to finish, accelerating the research‑to‑production cycle.
- Compatibility: RollMux works as a plug‑in on top of popular RL frameworks, so existing codebases need minimal changes—primarily configuration of group sizes and residency policies.
Limitations & Future Work
- On‑Policy Focus: The current design assumes strict rollout‑training synchronization; off‑policy or asynchronous RL algorithms may not benefit as much.
- Static Group Sizes: Groups are defined at job start‑up; dynamic resizing (e.g., scaling out when a job spikes) is not yet supported.
- Hardware Diversity: Evaluation was performed on homogeneous NVIDIA H20/H800 GPUs; heterogeneous accelerators (TPUs, AMD GPUs) could introduce new scheduling challenges.
- Future Directions: Extending the stochastic planner to handle heterogeneous resources, supporting dynamic group re‑partitioning, and exploring applicability to other pipeline‑style workloads (e.g., video transcoding, large‑scale data preprocessing).
Authors
- Tianyuan Wu
- Lunxi Cao
- Yining Wei
- Wei Gao
- Yuheng Zhao
- Dakai An
- Shaopan Xiong
- Zhiqiang Lv
- Ju Huang
- Siran Yang
- Yinghao Yu
- Jiamang Wang
- Lin Qu
- Wei Wang
Paper Information
- arXiv ID: 2512.11306v1
- Categories: cs.DC
- Published: December 12, 2025
- PDF: Download PDF