[Paper] Hummingbird: SLO-Oriented GPU Preemption at Microsecond-scale

Published: (January 7, 2026 at 11:36 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.04071v1

Overview

The paper introduces Hummingbird, a GPU‑scheduling framework that can preempt running kernels on closed‑source GPUs within a few microseconds. By doing so, it lets high‑priority workloads meet their Service‑Level Objectives (SLOs) while still squeezing out idle GPU cycles for lower‑priority jobs, dramatically improving both latency guarantees and overall utilization.

Key Contributions

  • Microsecond‑scale preemption on commodity, closed‑source GPUs without hardware modifications.
  • SLO‑oriented scheduler that dynamically decides when to preempt based on each task’s latency target.
  • Idle‑time harvesting mechanism that safely inserts low‑priority kernels into the gaps left by preempted high‑priority work.
  • Comprehensive evaluation across multiple GPU architectures showing up to 9.7× better SLO attainment for high‑priority tasks and 2.4× higher throughput for low‑priority tasks compared to prior spatial/temporal sharing schemes.
  • Minimal impact on exclusive execution: when a high‑priority job runs alongside low‑priority jobs under Hummingbird, its SLO degradation is < 1 % relative to running alone.

Methodology

  1. Preemption Engine – The authors reverse‑engineer the GPU command submission pipeline to insert a lightweight “checkpoint” that can abort a running kernel and restore the GPU state in ~10 µs.
  2. SLO‑aware Scheduler – Each incoming kernel is annotated with an SLO deadline. The scheduler continuously monitors progress and predicts whether the current kernel will miss its deadline; if so, it triggers preemption.
  3. Idle‑Slice Collector – When a high‑priority kernel is preempted, the scheduler looks for short idle windows (often just a few hundred microseconds) and packs low‑priority kernels into them, using a simple bin‑packing heuristic.
  4. Evaluation Suite – Experiments were run on NVIDIA RTX 3080, RTX 4090, and a data‑center‑grade A100, using a mix of deep‑learning inference, video transcoding, and scientific simulation kernels. Baselines included the best‑known spatial sharing (MPS) and temporal sharing (GPU‑time slicing) systems.

Results & Findings

MetricHummingbird vs. Spatial SharingHummingbird vs. Temporal Sharing
High‑priority SLO attainment9.7× improvement3.5× improvement
Low‑priority throughput2.4× higher
SLO degradation vs. exclusive run< 1 %
Preemption latency~12 µs (average)
  • Latency guarantees: High‑priority jobs consistently finish within their deadlines, even when co‑running with multiple low‑priority workloads.
  • Utilization boost: The system fills > 80 % of otherwise wasted GPU idle time, raising overall utilization from ~55 % (baseline) to > 90 %.
  • Scalability: Performance gains hold across different GPU generations, indicating the approach is not tied to a specific hardware revision.

Practical Implications

  • Cloud GPU services can offer tiered pricing (premium low‑latency vs. bulk low‑cost) while still guaranteeing SLOs, without needing specialized hardware.
  • Edge AI devices (e.g., autonomous drones) can run safety‑critical inference kernels alongside background analytics, ensuring real‑time response without sacrificing battery‑friendly throughput.
  • CI/CD pipelines for ML can schedule model training (low‑priority) and inference serving (high‑priority) on the same GPU node, cutting infrastructure costs.
  • Framework integration: The preemption primitives could be exposed via CUDA driver extensions or as a middleware layer, allowing existing libraries (TensorRT, PyTorch) to benefit with minimal code changes.

Limitations & Future Work

  • Closed‑source reliance: The technique depends on undocumented GPU driver behavior; future driver updates could break the preemption path.
  • Overhead for very short kernels: When kernels run for < 50 µs, the preemption cost can dominate, limiting applicability to longer workloads.
  • Scheduler heuristics: Current bin‑packing is simple; more sophisticated predictive models could further reduce SLO miss rates.
  • Multi‑GPU coordination: The paper focuses on a single GPU; extending Hummingbird to orchestrate preemption across a GPU cluster is an open challenge.

Overall, Hummingbird demonstrates that fine‑grained, microsecond‑level preemption is feasible on today’s GPUs and can unlock a new class of latency‑aware, high‑utilization workloads for both cloud and edge environments.

Authors

  • Tiancheng Hu
  • Chenxi Wang
  • Ting Cao
  • Jin Qin
  • Lei Chen
  • Xinyu Xiao
  • Junhao Hu
  • Hongliang Tian
  • Shoumeng Yan
  • Huimin Cui
  • Quan Chen
  • Tao Xie

Paper Information

  • arXiv ID: 2601.04071v1
  • Categories: cs.DC
  • Published: January 7, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »