[Paper] Hummingbird: SLO-Oriented GPU Preemption at Microsecond-scale

Published: 1 month ago (January 7, 2026 at 11:36 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.04071v1

Overview

The paper introduces Hummingbird, a GPU‑scheduling framework that can preempt running kernels on closed‑source GPUs within a few microseconds. By doing so, it lets high‑priority workloads meet their Service‑Level Objectives (SLOs) while still squeezing out idle GPU cycles for lower‑priority jobs, dramatically improving both latency guarantees and overall utilization.

Key Contributions

Microsecond‑scale preemption on commodity, closed‑source GPUs without hardware modifications.
SLO‑oriented scheduler that dynamically decides when to preempt based on each task’s latency target.
Idle‑time harvesting mechanism that safely inserts low‑priority kernels into the gaps left by preempted high‑priority work.
Comprehensive evaluation across multiple GPU architectures showing up to 9.7× better SLO attainment for high‑priority tasks and 2.4× higher throughput for low‑priority tasks compared to prior spatial/temporal sharing schemes.
Minimal impact on exclusive execution: when a high‑priority job runs alongside low‑priority jobs under Hummingbird, its SLO degradation is < 1 % relative to running alone.

Methodology

Preemption Engine – The authors reverse‑engineer the GPU command submission pipeline to insert a lightweight “checkpoint” that can abort a running kernel and restore the GPU state in ~10 µs.
SLO‑aware Scheduler – Each incoming kernel is annotated with an SLO deadline. The scheduler continuously monitors progress and predicts whether the current kernel will miss its deadline; if so, it triggers preemption.
Idle‑Slice Collector – When a high‑priority kernel is preempted, the scheduler looks for short idle windows (often just a few hundred microseconds) and packs low‑priority kernels into them, using a simple bin‑packing heuristic.
Evaluation Suite – Experiments were run on NVIDIA RTX 3080, RTX 4090, and a data‑center‑grade A100, using a mix of deep‑learning inference, video transcoding, and scientific simulation kernels. Baselines included the best‑known spatial sharing (MPS) and temporal sharing (GPU‑time slicing) systems.

Results & Findings

Metric	Hummingbird vs. Spatial Sharing	Hummingbird vs. Temporal Sharing
High‑priority SLO attainment	9.7× improvement	3.5× improvement
Low‑priority throughput	–	2.4× higher
SLO degradation vs. exclusive run	< 1 %	—
Preemption latency	~12 µs (average)	—

Latency guarantees: High‑priority jobs consistently finish within their deadlines, even when co‑running with multiple low‑priority workloads.
Utilization boost: The system fills > 80 % of otherwise wasted GPU idle time, raising overall utilization from ~55 % (baseline) to > 90 %.
Scalability: Performance gains hold across different GPU generations, indicating the approach is not tied to a specific hardware revision.

Practical Implications

Cloud GPU services can offer tiered pricing (premium low‑latency vs. bulk low‑cost) while still guaranteeing SLOs, without needing specialized hardware.
Edge AI devices (e.g., autonomous drones) can run safety‑critical inference kernels alongside background analytics, ensuring real‑time response without sacrificing battery‑friendly throughput.
CI/CD pipelines for ML can schedule model training (low‑priority) and inference serving (high‑priority) on the same GPU node, cutting infrastructure costs.
Framework integration: The preemption primitives could be exposed via CUDA driver extensions or as a middleware layer, allowing existing libraries (TensorRT, PyTorch) to benefit with minimal code changes.

Limitations & Future Work

Closed‑source reliance: The technique depends on undocumented GPU driver behavior; future driver updates could break the preemption path.
Overhead for very short kernels: When kernels run for < 50 µs, the preemption cost can dominate, limiting applicability to longer workloads.
Scheduler heuristics: Current bin‑packing is simple; more sophisticated predictive models could further reduce SLO miss rates.
Multi‑GPU coordination: The paper focuses on a single GPU; extending Hummingbird to orchestrate preemption across a GPU cluster is an open challenge.

Overall, Hummingbird demonstrates that fine‑grained, microsecond‑level preemption is feasible on today’s GPUs and can unlock a new class of latency‑aware, high‑utilization workloads for both cloud and edge environments.

Authors

Tiancheng Hu
Chenxi Wang
Ting Cao
Jin Qin
Lei Chen
Xinyu Xiao
Junhao Hu
Hongliang Tian
Shoumeng Yan
Huimin Cui
Quan Chen
Tao Xie

Paper Information

arXiv ID: 2601.04071v1
Categories: cs.DC
Published: January 7, 2026
PDF: Download PDF

[Paper] Hummingbird: SLO-Oriented GPU Preemption at Microsecond-scale

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization

[Paper] Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver

[Paper] LACIN: Linearly Arranged Complete Interconnection Networks

[Paper] Self-Evolving Distributed Memory Architecture for Scalable AI Systems