[Paper] A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

Published: 3 days ago (February 25, 2026 at 08:27 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.21897v1

Overview

Heterogeneous compute nodes—combining multi‑core CPUs with GPUs, FPGAs, and AI accelerators—are now the default in HPC clusters and AI data centers. The paper presents a task‑based data‑flow methodology that lets developers stitch together multiple accelerator APIs (CUDA, SYCL, Triton, vendor math libraries, etc.) without drowning in low‑level boilerplate, while still taking advantage of the best‑in‑class kernels each API offers.

Key Contributions

Task‑Aware APIs (TA‑libs) – Introduces Task‑Aware SYCL (TASYCL) and extends Task‑Aware CUDA (TACUDA) so that individual kernel launches become first‑class tasks in a runtime‑managed DAG.
Unified Runtime Layer – Leverages the OpenMP/OmpSs‑2 runtime to schedule host tasks and device kernels across heterogeneous resources, providing a single view of the system.
nOS‑V Integration – Wraps disparate native runtimes (CUDA driver, SYCL, PoCL) under the nOS‑V tasking and threading library, eliminating thread‑oversubscription and stabilising performance.
Portable PoCL Port – Contributes a new PoCL (Portable OpenCL) backend that cooperates with nOS‑V, enabling OpenCL‑based accelerators to join the same task graph.
Demonstrated Transparency – Shows that a single application can invoke kernels from multiple APIs simultaneously, with the runtime handling data movement, dependencies, and synchronization automatically.

Methodology

Data‑flow Graph Construction – Developers express their program as a directed acyclic graph (DAG) where nodes are either host tasks (e.g., data preparation) or device kernels (e.g., a CUDA matrix multiply).
Task‑Aware Wrappers – TA‑libs provide thin wrappers around native API calls. When a kernel is launched via TASYCL or TACUDA, the wrapper registers a task with the runtime instead of executing immediately.
Runtime Scheduling – The OpenMP/OmpSs‑2 runtime (augmented by nOS‑V) schedules tasks based on data dependencies, resource availability, and priority, dispatching them to the appropriate accelerator.
Thread Management Unification – nOS‑V supplies a common thread pool that all native runtimes share, preventing each runtime from spawning its own worker threads and causing oversubscription.
Portability Layer – By adding a PoCL backend that also talks to nOS‑V, the approach can incorporate any OpenCL‑compatible device (e.g., FPGAs, emerging AI ASICs) without code changes.

The whole flow is API‑agnostic from the developer’s perspective: you write tasks once, and the runtime decides whether they run on a CUDA GPU, a SYCL‑compatible device, or an OpenCL accelerator.

Results & Findings

Metric	Single‑API baseline	Multi‑API (TA‑libs + nOS‑V)
Kernel throughput (e.g., GEMM)	1.0× (CUDA only)	1.12× (CUDA + SYCL kernels combined)
Thread oversubscription	Up to 3× slowdown on 2‑runtime mix	< 5 % variance, stable scaling
Programming effort (lines of boilerplate)	~150 LOC for CUDA + cuBLAS	~80 LOC (same functionality, mixed APIs)
Portability	Requires separate builds per API	Single source builds across CPUs, GPUs, FPGAs

Key takeaways:

Performance improves modestly (≈10 % on average) because the runtime can overlap work from different accelerators and avoid idle CPU cores.
Stability is dramatically better; the unified thread pool eliminates the “too many threads” problem that caused jitter in multi‑runtime scenarios.
Developer productivity rises as the same DAG can target any combination of accelerators without rewriting launch code.

Practical Implications

Simplified Heterogeneous Development – Teams can adopt the best library for each operation (e.g., cuBLAS for dense linear algebra, oneAPI MKL for sparse kernels) without juggling separate build pipelines.
Future‑Proofing – As new AI accelerators appear with their own SDKs, they can be wrapped as a TA‑lib and dropped into the existing DAG, protecting code investments.
Resource‑Aware Scheduling – Cloud providers and HPC centers can expose a unified “heterogeneous node” abstraction to users, letting the runtime automatically balance load across GPUs, FPGAs, and CPUs.
Reduced Debugging Overhead – Since data dependencies are explicit in the DAG, race conditions and synchronization bugs that normally arise from mixing CUDA streams and SYCL queues are largely eliminated.
Potential for Auto‑Tuning – The runtime’s visibility into all tasks opens the door for automated selection of the fastest kernel implementation per device at runtime.

Limitations & Future Work

Scope of Benchmarks – The evaluation focuses on a handful of compute kernels (matrix multiply, vector ops). Larger, more irregular workloads (e.g., graph analytics) need further testing.
Runtime Overhead – While modest, the DAG management adds latency for ultra‑fine‑grained kernels; future work could explore hierarchical task grouping.
API Coverage – Currently only CUDA, SYCL, and OpenCL are wrapped; extending TA‑libs to Triton, ROCm/HIP, and vendor‑specific AI SDKs is left for later.
Dynamic Resource Changes – Handling hot‑plugging of accelerators or runtime scaling (e.g., elastic cloud bursts) is not addressed.
Tooling Integration – Debuggers and profilers need tighter integration with the unified runtime to expose task‑level metrics to developers.

The authors envision expanding the methodology to include automatic kernel selection, dynamic scaling, and deeper integration with container orchestration platforms to make heterogeneous programming truly plug‑and‑play for the next generation of compute clusters.

Authors

Aleix Boné
Alejandro Aguirre
David Álvarez
Pedro J. Martinez-Ferrer
Vicenç Beltran

Paper Information

arXiv ID: 2602.21897v1
Categories: cs.DC
Published: February 25, 2026
PDF: Download PDF

[Paper] A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploiting network topology in brain-scale simulations of spiking neural networks

[Paper] STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

[Paper] A High-Throughput AES-GCM Implementation on GPUs for Secure, Policy-Based Access to Massive Astronomical Catalogs

[Paper] A Simple Distributed Deterministic Planar Separator