[Paper] SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines

Published: 6 days ago (June 3, 2026 at 06:38 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.05495v1

Overview

Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for task-parallel pipelines to minimize the synchronization overheads and the gap between kernel executions. The proposed solution combines two innovations: (1) a multi-stream task-parallel pipeline programming model that leverages event-chaining and work-stealing mechanisms to fully utilize available hardware resources; (2) a graph-based execution flow with per-stream buffers to ensure memory safety for multiple in-flight jobs running concurrently. Extensive evaluations on representative real-world workloads show 1.15—1.44X speedup and reduce scheduling overheads by 18—54% compared to state-of-the-art CUDA graph baselines.

Key Contributions

This paper presents research in the following areas:

cs.DC
cs.AR

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.DC.

Authors

Zhengxiong Li
Tsung-Wei Huang
Umit Ogras

Paper Information

arXiv ID: 2606.05495v1
Categories: cs.DC, cs.AR
Published: June 3, 2026
PDF: Download PDF

[Paper] SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

[Paper] Predictive Autoscaling in Cloud-Native and Federated Cloud-Edge Computing Environments: A Taxonomy and Future Directions

[Paper] PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

[Paper] Mission-Level Runtime Assurance Framework for Autonomous Driving