[Paper] A Real-Time Digital Twin for Adaptive Scheduling

Published: (December 21, 2025 at 04:37 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.18894v1

Overview

The paper introduces SchedTwin, a real‑time digital‑twin framework that continuously mirrors an HPC cluster’s scheduler, runs rapid “what‑if” simulations of alternative policies, and automatically picks the best one for the current workload. By turning the traditionally static, heuristic‑driven scheduling loop into an adaptive decision engine, the authors demonstrate measurable gains on a production PBS system with only a few seconds of overhead per scheduling cycle.

Key Contributions

  • Digital‑twin architecture for scheduling – a lightweight, continuously updated replica of the live scheduler that can evaluate multiple policies in parallel.
  • Fast what‑if simulation engine – a high‑fidelity discrete‑event simulator optimized to return results within seconds, enabling real‑time feedback.
  • Policy‑selection controller – an algorithm that maps simulation outcomes to administrator‑defined objectives (e.g., throughput, fairness, energy).
  • Open‑source implementation – SchedTwin is released under a permissive license and integrated with the widely used PBS scheduler.
  • Empirical validation – experiments on a production HPC cluster show consistent performance improvements over static policies such as FCFS, backfill, and priority‑based scheduling.

Methodology

  1. Event Ingestion – SchedTwin hooks into the production scheduler (PBS) and periodically pulls job submissions, completions, and resource‑state updates.
  2. State Replication – The captured events are used to reconstruct the current cluster state inside a discrete‑event simulation model that mirrors the real hardware (node counts, core counts, network topology).
  3. Policy Evaluation – For each scheduling cycle, the twin runs several candidate policies (e.g., backfill, shortest‑job‑first, energy‑aware) on the simulated state. Because the simulator is highly optimized (event‑driven, minimal bookkeeping), each run finishes in a few seconds.
  4. Objective‑Driven Selection – The outcomes (e.g., predicted job wait time, system utilization, power consumption) are scored against the administrator’s objective function. The policy with the best score is selected and its decisions are handed back to the live scheduler.
  5. Feedback Loop – After the real scheduler executes the chosen decisions, the next cycle repeats, keeping the twin synchronized with the actual system.

Results & Findings

MetricStatic Policy (Baseline)SchedTwin (Best Policy)Improvement
Average job wait time12.4 min9.1 min‑27 %
System utilization (CPU)78 %84 %+6 %
Energy‑to‑solution (kWh per job)0.420.38‑9 %
Overhead per cycle2–4 snegligible vs. multi‑hour scheduling windows

The authors emphasize that SchedTwin never degrades performance; even when the simulated “best” policy is sub‑optimal for a particular workload, the overhead is low enough that the live scheduler can fall back to its default policy without noticeable impact.

Practical Implications

  • Dynamic workload adaptation – Data centers can automatically shift between throughput‑oriented and fairness‑oriented policies as job mixes change throughout the day.
  • Energy savings – By selecting energy‑aware policies when utilization is low, operators can reduce power consumption without sacrificing job turnaround.
  • Reduced admin burden – Administrators no longer need to manually tune heuristic parameters; the twin continuously optimizes against the chosen objective.
  • Plug‑and‑play for existing stacks – Because SchedTwin integrates with PBS (and, by extension, other Slurm‑compatible schedulers with minor adapters), organizations can adopt it without a full system redesign.
  • Foundation for AI‑enhanced scheduling – The digital‑twin framework provides a sandbox where machine‑learning models can be trained and evaluated safely before deployment.

Limitations & Future Work

  • Scalability to exascale clusters – The current prototype is validated on a mid‑size production system; scaling the simulation to tens of thousands of nodes may require further parallelization.
  • Policy library breadth – Only a handful of classic policies were evaluated; extending the framework to incorporate more sophisticated, domain‑specific heuristics (e.g., GPU‑aware scheduling) is left for future work.
  • Robustness to prediction errors – The twin assumes that the simulated model faithfully reflects real hardware behavior; mismatches (e.g., network contention) could lead to sub‑optimal selections.
  • User‑level QoS constraints – Incorporating per‑user or per‑project SLAs into the objective function remains an open challenge.

The authors plan to explore distributed simulation techniques, richer policy catalogs, and tighter integration with machine‑learning‑based decision makers to address these gaps.

Authors

  • Yihe Zhang
  • Yash Kurkure
  • Yiheng Tao
  • Michael E. Papka
  • Zhiling Lan

Paper Information

  • arXiv ID: 2512.18894v1
  • Categories: cs.DC
  • Published: December 21, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »