[Paper] A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs
Source: arXiv - 2602.21897v1
Overview
Heterogeneous compute nodes—combining multi‑core CPUs with GPUs, FPGAs, and AI accelerators—are now the default in HPC clusters and AI data centers. The paper presents a task‑based data‑flow methodology that lets developers stitch together multiple accelerator APIs (CUDA, SYCL, Triton, vendor math libraries, etc.) without drowning in low‑level boilerplate, while still taking advantage of the best‑in‑class kernels each API offers.
Key Contributions
- Task‑Aware APIs (TA‑libs) – Introduces Task‑Aware SYCL (TASYCL) and extends Task‑Aware CUDA (TACUDA) so that individual kernel launches become first‑class tasks in a runtime‑managed DAG.
- Unified Runtime Layer – Leverages the OpenMP/OmpSs‑2 runtime to schedule host tasks and device kernels across heterogeneous resources, providing a single view of the system.
- nOS‑V Integration – Wraps disparate native runtimes (CUDA driver, SYCL, PoCL) under the nOS‑V tasking and threading library, eliminating thread‑oversubscription and stabilising performance.
- Portable PoCL Port – Contributes a new PoCL (Portable OpenCL) backend that cooperates with nOS‑V, enabling OpenCL‑based accelerators to join the same task graph.
- Demonstrated Transparency – Shows that a single application can invoke kernels from multiple APIs simultaneously, with the runtime handling data movement, dependencies, and synchronization automatically.
Methodology
- Data‑flow Graph Construction – Developers express their program as a directed acyclic graph (DAG) where nodes are either host tasks (e.g., data preparation) or device kernels (e.g., a CUDA matrix multiply).
- Task‑Aware Wrappers – TA‑libs provide thin wrappers around native API calls. When a kernel is launched via TASYCL or TACUDA, the wrapper registers a task with the runtime instead of executing immediately.
- Runtime Scheduling – The OpenMP/OmpSs‑2 runtime (augmented by nOS‑V) schedules tasks based on data dependencies, resource availability, and priority, dispatching them to the appropriate accelerator.
- Thread Management Unification – nOS‑V supplies a common thread pool that all native runtimes share, preventing each runtime from spawning its own worker threads and causing oversubscription.
- Portability Layer – By adding a PoCL backend that also talks to nOS‑V, the approach can incorporate any OpenCL‑compatible device (e.g., FPGAs, emerging AI ASICs) without code changes.
The whole flow is API‑agnostic from the developer’s perspective: you write tasks once, and the runtime decides whether they run on a CUDA GPU, a SYCL‑compatible device, or an OpenCL accelerator.
Results & Findings
| Metric | Single‑API baseline | Multi‑API (TA‑libs + nOS‑V) |
|---|---|---|
| Kernel throughput (e.g., GEMM) | 1.0× (CUDA only) | 1.12× (CUDA + SYCL kernels combined) |
| Thread oversubscription | Up to 3× slowdown on 2‑runtime mix | < 5 % variance, stable scaling |
| Programming effort (lines of boilerplate) | ~150 LOC for CUDA + cuBLAS | ~80 LOC (same functionality, mixed APIs) |
| Portability | Requires separate builds per API | Single source builds across CPUs, GPUs, FPGAs |
Key takeaways:
- Performance improves modestly (≈10 % on average) because the runtime can overlap work from different accelerators and avoid idle CPU cores.
- Stability is dramatically better; the unified thread pool eliminates the “too many threads” problem that caused jitter in multi‑runtime scenarios.
- Developer productivity rises as the same DAG can target any combination of accelerators without rewriting launch code.
Practical Implications
- Simplified Heterogeneous Development – Teams can adopt the best library for each operation (e.g., cuBLAS for dense linear algebra, oneAPI MKL for sparse kernels) without juggling separate build pipelines.
- Future‑Proofing – As new AI accelerators appear with their own SDKs, they can be wrapped as a TA‑lib and dropped into the existing DAG, protecting code investments.
- Resource‑Aware Scheduling – Cloud providers and HPC centers can expose a unified “heterogeneous node” abstraction to users, letting the runtime automatically balance load across GPUs, FPGAs, and CPUs.
- Reduced Debugging Overhead – Since data dependencies are explicit in the DAG, race conditions and synchronization bugs that normally arise from mixing CUDA streams and SYCL queues are largely eliminated.
- Potential for Auto‑Tuning – The runtime’s visibility into all tasks opens the door for automated selection of the fastest kernel implementation per device at runtime.
Limitations & Future Work
- Scope of Benchmarks – The evaluation focuses on a handful of compute kernels (matrix multiply, vector ops). Larger, more irregular workloads (e.g., graph analytics) need further testing.
- Runtime Overhead – While modest, the DAG management adds latency for ultra‑fine‑grained kernels; future work could explore hierarchical task grouping.
- API Coverage – Currently only CUDA, SYCL, and OpenCL are wrapped; extending TA‑libs to Triton, ROCm/HIP, and vendor‑specific AI SDKs is left for later.
- Dynamic Resource Changes – Handling hot‑plugging of accelerators or runtime scaling (e.g., elastic cloud bursts) is not addressed.
- Tooling Integration – Debuggers and profilers need tighter integration with the unified runtime to expose task‑level metrics to developers.
The authors envision expanding the methodology to include automatic kernel selection, dynamic scaling, and deeper integration with container orchestration platforms to make heterogeneous programming truly plug‑and‑play for the next generation of compute clusters.
Authors
- Aleix Boné
- Alejandro Aguirre
- David Álvarez
- Pedro J. Martinez-Ferrer
- Vicenç Beltran
Paper Information
- arXiv ID: 2602.21897v1
- Categories: cs.DC
- Published: February 25, 2026
- PDF: Download PDF