[Paper] Enabling Scientific Workflow Scheduling Research in Non-Uniform Memory Access Architectures

Published: 2 months ago (November 24, 2025 at 08:50 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.19832v1

Overview

The paper presents nFlows, a runtime system that brings NUMA‑awareness to scientific workflow scheduling on modern high‑performance computing (HPC) nodes. By exposing the memory‑locality quirks of multi‑domain CPUs, HBM/DRAM hierarchies, and attached accelerators, nFlows lets researchers and engineers model, simulate, and run workflows with realistic NUMA effects—something most existing schedulers, built for Grid or Cloud environments, simply ignore.

Key Contributions

nFlows Runtime – a full‑stack execution environment that models NUMA domains, heterogeneous memory (HBM vs. DRAM), and accelerator placement (GPU/FPGAs, NICs).
Unified Simulation‑to‑Bare‑Metal Flow – the same workflow description can be executed in a fast discrete‑event simulator or on real hardware without code changes.
NUMA‑Aware Scheduling API – hooks for plugging in custom placement heuristics that consider data locality at the node‑level.
Validation Framework – systematic methodology to compare simulated predictions against measurements on actual NUMA‑based HPC nodes.
Open‑source Prototype – the authors release the core components, enabling the community to reproduce experiments and extend the platform.

Methodology

System Modeling – The authors first characterize a typical modern HPC node (multiple CPU sockets, each with several NUMA domains, HBM stacks, DRAM banks, and PCIe‑attached devices). They capture latency and bandwidth matrices for memory accesses across these domains.
Workflow Representation – Scientific workflows are expressed as directed acyclic graphs (DAGs) where nodes are tasks and edges are data dependencies. Each task carries metadata about required memory size, compute intensity, and optional accelerator affinity.
Runtime Engine – nFlows parses the DAG, queries the NUMA topology via Linux numactl/hwloc, and schedules tasks onto specific cores and memory regions. It also pins data buffers to the chosen NUMA node to enforce locality.
Simulation Layer – A discrete‑event simulator reuses the same scheduling code but replaces actual execution with estimated compute/transfer times derived from the latency/bandwidth model. This enables rapid “what‑if” studies.
Validation – The authors run a set of representative data‑intensive workflows (e.g., genomics pipelines, climate simulations) both in simulation and on a 2‑socket, 8‑NUMA‑domain testbed equipped with HBM and GPUs. They compare makespan, memory bandwidth utilization, and inter‑NUMA traffic.

Results & Findings

Simulation Accuracy – Predicted makespans were within ±8 % of the measured runs, confirming that the latency/bandwidth model captures the dominant NUMA effects.
Performance Gains – NUMA‑aware placement reduced inter‑domain memory traffic by 30‑45 %, translating into 10‑20 % lower overall workflow execution time compared to a naïve round‑robin scheduler.
Accelerator Co‑Location – Pinning GPU‑bound tasks to the same NUMA node as their associated NIC cut data‑transfer latency by ≈15 %, benefitting I/O‑heavy stages.
In‑Memory Execution Feasibility – By keeping intermediate datasets in HBM on the same domain as the consuming task, the authors demonstrated up to 2× speed‑up for memory‑bound kernels.

Practical Implications

HPC Application Developers can integrate nFlows (or its API concepts) into existing workflow engines (e.g., Pegasus, Airflow) to automatically exploit NUMA locality without hand‑tuning.
Scheduler Vendors gain a testbed for prototyping NUMA‑aware heuristics—such as domain‑aware backfilling or HBM‑first placement—before shipping them to production clusters.
System Administrators receive a diagnostic tool that highlights NUMA‑induced bottlenecks, helping them to configure BIOS/OS settings (e.g., memory interleaving) for optimal throughput.
Cloud‑Edge Providers that expose bare‑metal instances with NUMA characteristics can use nFlows to offer “NUMA‑optimized” workflow services, differentiating themselves from generic VM‑based offerings.

Limitations & Future Work

The current prototype targets Linux x86‑64 nodes; ARM‑based or emerging disaggregated memory systems are not yet supported.
Only a subset of accelerators (NVIDIA GPUs, Intel FPGAs) were evaluated; extending support to AMD GPUs or custom ASICs remains open.
The authors acknowledge that their latency model assumes static bandwidth; dynamic contention (e.g., from OS background traffic) can degrade prediction accuracy.
Future directions include adaptive scheduling that reacts to runtime telemetry, integration with container orchestration (Kubernetes), and support for distributed NUMA across multiple nodes (e.g., via RDMA‑aware placement).

Authors

Aurelio Vivas
Harold Castro

Paper Information

arXiv ID: 2511.19832v1
Categories: cs.DC
Published: November 25, 2025
PDF: Download PDF

[Paper] Enabling Scientific Workflow Scheduling Research in Non-Uniform Memory Access Architectures

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

# Otimizando Imagens Docker: Boas Práticas para Builds Eficientes

Amazon EKS Capabilities: Quick Summary

Why Junior Developers Remain Essential in the Age of AI

AWS re:Invent 2025: How to watch and follow along live