[Paper] Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Published: 1 month ago (December 10, 2025 at 11:19 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10271v1

Overview

Deep‑learning (DL) training jobs now dominate cloud GPU workloads, but the rapid rise of heterogeneous GPU clusters (different models, memory sizes, and interconnects) makes it hard for traditional schedulers to keep GPUs busy and jobs short. The paper introduces RLTune, a reinforcement‑learning (RL) driven scheduler that works without any per‑job profiling and couples RL‑based job prioritization with a mixed‑integer linear programming (MILP) optimizer to map jobs to the most suitable nodes in real time.

Key Contributions

Application‑agnostic RL prioritizer – learns to rank incoming DL jobs using only observable metrics (e.g., requested resources, historical queue times), eliminating the need for offline profiling.
Hybrid RL + MILP framework – combines fast, learned priority scores with an exact MILP solver that produces optimal job‑to‑GPU‑node assignments under multiple objectives (completion time, queue delay, utilization).
Large‑scale production evaluation – trained and validated on trace data from Microsoft Philly, Helios, and Alibaba, demonstrating real‑world relevance.
Significant performance gains – up to 20 % higher GPU utilization, 81 % lower queueing delay, and 70 % shorter job completion times compared with state‑of‑the‑art schedulers.
Generalizable design – works across diverse DL workloads (CNNs, Transformers, RL agents) without hand‑crafted heuristics or model‑specific tuning.

Methodology

Data Collection – the authors harvested millions of job submissions from three production clusters, extracting lightweight features such as requested GPU count, memory, estimated runtime, and current cluster state.
RL Prioritization – a policy network (a small feed‑forward neural net) receives the feature vector and outputs a priority score. The policy is trained with a reward that balances three goals: (a) minimizing job completion time, (b) reducing queue length, and (c) maximizing overall GPU utilization. Proximal Policy Optimization (PPO) is used for stable learning.
MILP Mapping – given the ordered list of jobs from the RL module, a MILP formulation decides which GPU node each job should run on. Constraints capture heterogeneity (different GPU memory, compute capability, PCIe/NVLink bandwidth) and system limits (max jobs per node, fairness caps). The objective mirrors the RL reward but is solved to optimality for the current batch.
Online Loop – the scheduler runs in a sliding‑window fashion: every few seconds it re‑evaluates pending jobs, updates priorities, resolves the MILP, and dispatches jobs. This keeps the system responsive to workload bursts and node failures.
Training & Deployment – the RL policy is pre‑trained offline on historical traces, then fine‑tuned online with a small learning rate to adapt to evolving workloads.

Results & Findings

Metric	Baseline (Kubernetes‑GPU)	Prior art (Tiresias)	RLTune
GPU Utilization	62 %	68 %	78 %
Avg. Queue Delay	12 min	6 min	2.3 min
Avg. Job Completion Time	4.5 h	3.2 h	1.35 h
Fairness (JCT variance)	1.8×	1.4×	1.1×

Utilization boost stems mainly from the MILP’s ability to pack small jobs onto under‑utilized GPUs and to co‑locate compatible jobs on the same node.
Queue reduction is driven by the RL prioritizer, which learns to promote short‑running or latency‑sensitive jobs when the system is congested.
JCT improvement is a compound effect of better packing and smarter ordering, especially noticeable for long‑running training runs that would otherwise monopolize high‑end GPUs.
The system remains stable under workload spikes, with the RL component quickly re‑ranking jobs and the MILP re‑optimizing within sub‑second runtimes (average solve time < 200 ms for clusters of up to 256 GPUs).

Practical Implications

Cloud providers can integrate RLTune into existing orchestration layers (e.g., Kubernetes‑GPU, Slurm) to squeeze more work out of the same hardware, reducing capital expenses and improving customer SLAs.
ML engineers benefit from lower wait times for training jobs, enabling faster iteration cycles and more aggressive hyper‑parameter searches.
Energy & sustainability – higher utilization translates directly into lower per‑job energy consumption, aligning with green‑computing initiatives.
Multi‑tenant fairness – because the RL reward balances fairness, smaller teams or bursty workloads are less likely to be starved, which is crucial for shared‑resource platforms.
Extensibility – the hybrid RL + MILP pattern can be reused for other heterogeneous resources (TPUs, FPGAs) or for scheduling inference workloads with latency constraints.

Limitations & Future Work

Scalability of MILP – while the current implementation solves clusters up to ~256 GPUs quickly, larger clusters may require decomposition or heuristic approximations.
Feature set simplicity – the RL policy uses only coarse‑grained job descriptors; richer signals (e.g., model architecture, data I/O patterns) could further improve predictions but would increase overhead.
Cold‑start behavior – the system relies on a pre‑trained policy; in brand‑new clusters with no historical traces, performance may initially lag until enough data is collected.
Robustness to failures – the paper assumes node failures are rare; integrating fault‑tolerance (e.g., dynamic re‑mapping of jobs mid‑run) is left for future exploration.
Generalization beyond DL – extending RLTune to non‑DL GPU workloads (e.g., graphics rendering, scientific simulations) would test the true application‑agnostic claim.

Overall, RLTune showcases how a blend of learning‑based prioritization and classic optimization can tackle the growing complexity of heterogeneous GPU scheduling, offering a practical pathway for cloud operators to deliver faster, fairer, and more efficient deep‑learning services.

Authors

Shruti Dongare
Redwan Ibne Seraj Khan
Hadeel Albahar
Nannan Zhao
Diego Melendez Maita
Ali R. Butt

Paper Information

arXiv ID: 2512.10271v1
Categories: cs.DC, cs.AI, cs.LG
Published: December 11, 2025
PDF: Download PDF

[Paper] Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously