[Paper] SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines

Published: (June 3, 2026 at 06:38 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.05495v1

Overview

Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for task-parallel pipelines to minimize the synchronization overheads and the gap between kernel executions. The proposed solution combines two innovations: (1) a multi-stream task-parallel pipeline programming model that leverages event-chaining and work-stealing mechanisms to fully utilize available hardware resources; (2) a graph-based execution flow with per-stream buffers to ensure memory safety for multiple in-flight jobs running concurrently. Extensive evaluations on representative real-world workloads show 1.15—1.44X speedup and reduce scheduling overheads by 18—54% compared to state-of-the-art CUDA graph baselines.

Key Contributions

This paper presents research in the following areas:

  • cs.DC
  • cs.AR

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.DC.

Authors

  • Zhengxiong Li
  • Tsung-Wei Huang
  • Umit Ogras

Paper Information

  • arXiv ID: 2606.05495v1
  • Categories: cs.DC, cs.AR
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »