[Paper] Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

Published: (February 19, 2026 at 07:26 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17318v1

Overview

The paper investigates how malleable jobs—applications that can shrink or grow their allocated compute nodes while running—can dramatically improve the efficiency of large‑scale HPC clusters. By replaying real workload traces from three flagship supercomputers (Cori, Eagle, and Theta) with a custom simulator, the authors show that even a modest fraction of malleable jobs can cut turnaround times and boost node utilization.

Key Contributions

  • Real‑world evaluation: Uses production traces from three distinct supercomputers, providing a realistic assessment of malleability in practice.
  • Comprehensive scheduler comparison: Benchmarks five scheduling strategies, including a novel “preferred‑allocation‑preserving” policy that tries to keep malleable jobs at their optimal size.
  • Quantified benefits: Demonstrates up to 67 % reduction in turnaround time, 99 % drop in wait time, and 52 % increase in node utilization when the workload is fully malleable.
  • Sensitivity analysis: Shows that even 20 % malleable jobs yields sizable gains (e.g., 37 % faster turnaround).
  • Correlation insights: Links workload characteristics (job length, node demand) to the effectiveness of each scheduling strategy, guiding future scheduler design.

Methodology

  1. Trace collection – The authors extracted job submission logs (arrival time, requested nodes, runtime) from the Cori, Eagle, and Theta supercomputers.
  2. Simulation platform (ElastiSim) – A discrete‑event simulator that can dynamically resize jobs during execution, mimicking a malleable runtime system.
  3. Workload transformation – For each trace, a configurable percentage (0–100 %) of jobs is marked as malleable. The rest remain rigid.
  4. Scheduling policies – Five policies are implemented:
    • Baseline rigid‑only FIFO/Backfill
    • Simple elastic backfill (elastic jobs can shrink to fit)
    • Elastic backfill with aggressive expansion
    • Preferred‑allocation‑preserving (novel) – keeps a job at its “sweet‑spot” size unless the system is under‑utilized.
  5. Metrics captured – Turnaround time, makespan, wait time, and node utilization are recorded for each simulation run.

The approach abstracts away low‑level hardware details, focusing on the resource management layer that most HPC operators and scheduler developers care about.

Results & Findings

Metric (Best‑case)CoriEagleTheta
Turnaround ↓37 %45 %67 %
Makespan ↓16 %32 %65 %
Wait time ↓73 %88 %99 %
Node utilization ↑5 %21 %52 %
  • Malleability shines: All five policies outperform the rigid baseline; the novel policy consistently ranks highest for each system.
  • Diminishing returns: Gains plateau after ~60 % malleable jobs, but the curve is steep early on—20 % malleable already cuts wait time by >70 % on Theta.
  • Workload shape matters: Systems with many short, small‑node jobs (e.g., Eagle) benefit more from aggressive expansion, while those with long, large‑node jobs (Theta) see the biggest utilization boost from the preferred‑allocation policy.
  • Correlation: Longer runtimes amplify the impact of elasticity because there’s more opportunity to re‑size the job as the cluster state evolves.

Practical Implications

  • Scheduler developers can adopt the preferred‑allocation‑preserving heuristic with minimal changes; it requires only a “desired node count” field in the job descriptor and a runtime hook to shrink/expand jobs.
  • Application developers are encouraged to instrument their codes with MPI dynamic process management (e.g., MPI_Comm_spawn, MPI_Comm_disconnect) or use frameworks like Charm++, HPX, or Adaptive MPI, enabling the runtime to honor scheduler resize requests.
  • HPC operators can start with a pilot program where a subset of users opt‑in to malleable jobs, immediately reaping lower queue times and higher cluster throughput without hardware upgrades.
  • Cloud‑HPC hybrids: The same elasticity concepts translate to spot‑instance management, where jobs can shed nodes when spot prices spike, improving cost efficiency.
  • Tooling: The open‑source ElastiSim simulator can be integrated into existing scheduler testbeds (e.g., Slurm’s srun/sbatch emulation) to evaluate policy tweaks before production rollout.

Limitations & Future Work

  • Simulation fidelity: While ElastiSim models job resizing, it abstracts away communication overhead and checkpoint/restart costs that real applications may incur when changing node counts.
  • User adoption barrier: The study assumes jobs can be made malleable; in practice, retrofitting legacy codes is non‑trivial.
  • Policy scope: Only five heuristics were explored; machine‑learning‑driven or predictive policies could further improve outcomes.
  • Heterogeneous resources: The experiments focus on homogeneous node pools; extending the analysis to GPU‑accelerated or memory‑tiered nodes is an open direction.

Overall, the paper provides compelling evidence that even modest adoption of malleable job scheduling can unlock substantial efficiency gains in today’s HPC ecosystems, offering a clear roadmap for developers, scheduler engineers, and system operators alike.

Authors

  • Patrick Zojer
  • Jonas Posner
  • Taylan Özden

Paper Information

  • arXiv ID: 2602.17318v1
  • Categories: cs.DC
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »