[Paper] ZipFlow: a Compiler-based Framework to Unleash Compressed Data Movement for Modern GPUs

Published: 3 days ago (February 8, 2026 at 08:17 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.08190v1

Overview

The paper introduces ZipFlow, a compiler‑driven framework that automatically restructures the whole pipeline of compress‑transfer‑decompress for GPU‑accelerated analytics. By treating compression algorithms as patterns of parallelism, ZipFlow can generate GPU kernels that squeeze the most out of modern PCIe‑Gen4/5 and GPU compute units, cutting data‑movement latency and boosting end‑to‑end query performance.

Key Contributions

Pattern‑based classification of compression algorithms into three parallelism categories, enabling a unified optimization strategy.
Compiler‑level scheduling that automatically maps each pattern to the most efficient GPU execution model (e.g., warp‑level, block‑level, or cooperative groups).
Holistic end‑to‑end optimization that co‑optimizes compression, PCIe transfer, and decompression, rather than treating them as isolated steps.
Portable performance across diverse GPU architectures (NVIDIA Ampere, Ada Lovelace, etc.) without hand‑tuned kernel code.
Empirical validation on the TPC‑H benchmark showing a 2.08× speedup over the best existing GPU compression library (nvCOMP) and 3.14× over CPU‑only engines like DuckDB.

Methodology

Pattern Identification – The authors analyze a wide range of lossless compressors (e.g., Snappy, LZ4, ZSTD) and distill three core parallelism patterns:
- Embarrassingly parallel (independent blocks)
- Fine‑grained data‑dependent (byte‑wise streams)
- Hybrid (mix of block‑level and intra‑block parallelism)
Compiler Front‑end – ZipFlow extends LLVM with custom passes that recognize these patterns in user‑provided compression code (or library calls) and annotate them with metadata.
Scheduling Engine – Based on the pattern tag, the engine selects a pre‑tuned kernel template:
- Block‑level kernels for independent blocks (maximizing occupancy)
- Warp‑level cooperative kernels for data‑dependent streams (leveraging warp shuffles)
- Hybrid kernels that dynamically switch between the two during execution
Data‑Movement Orchestration – The generated code pipelines compression on the CPU (or a dedicated “compression GPU”) with PCIe DMA, overlapping transfer with decompression on the target GPU. The compiler inserts asynchronous copy commands and stream synchronizations to hide latency.
Auto‑tuning – A lightweight runtime profiler evaluates a few candidate configurations (block size, shared‑memory usage) on the target hardware and picks the best one for the workload.

Results & Findings

Metric	Baseline (nvCOMP)	Baseline (DuckDB)	ZipFlow
TPC‑H Q1 latency	12.4 s	38.7 s	5.9 s
Average speedup (all queries)	1.0×	1.0×	2.08× (vs. nvCOMP) / 3.14× (vs. DuckDB)
PCIe bandwidth utilization	~45 %	N/A	~78 % (thanks to overlapped transfer)
GPU compute utilization	30 % (compression only)	N/A	62 % (compression + decompression)

Key takeaways

Matching the right parallelism pattern to the GPU’s execution model reduces compression/decompression overhead by up to 45 %.
Overlapping transfer with computation yields a 33 % reduction in effective I/O latency.
The framework scales across GPU generations without manual retuning, confirming its portability.

Practical Implications

Data‑Lake Ingestion Pipelines – Engineers can plug ZipFlow into ETL jobs that move terabytes of CSV/Parquet data into GPU‑accelerated analytics engines (e.g., RAPIDS, BlazingSQL) and expect roughly 2× faster ingestion.
Real‑time Dashboards – For low‑latency BI dashboards that rely on GPU‑powered query engines, ZipFlow’s reduced transfer time translates directly into fresher data and tighter SLAs.
Cost Savings – Faster end‑to‑end queries mean fewer GPU seconds per workload, which can lower cloud GPU spend, especially when PCIe bandwidth is a billed resource (e.g., on AWS EC2 with Elastic Fabric Adapter).
Developer Productivity – Because ZipFlow works at the compiler level, developers keep writing code against familiar compression libraries; the heavy lifting of kernel selection and stream orchestration is automated.
Future‑Proofing – As PCIe 5.0 and NVLink become mainstream, ZipFlow’s pattern‑based approach will continue to extract the best trade‑off between I/O and compute without rewriting kernels.

Limitations & Future Work

Compression Algorithm Coverage – The study focuses on a subset of widely used lossless compressors; exotic or domain‑specific codecs may not fit cleanly into the three patterns.
CPU‑Side Compression Overhead – ZipFlow assumes the CPU can keep up with the GPU’s consumption rate; on systems with weak CPUs, the pipeline may still be bottlenecked before the data even reaches the GPU.
Multi‑GPU Scaling – The current implementation optimizes a single‑GPU data path; extending the scheduler to coordinate compression across multiple GPUs (e.g., in a DGX‑2) is left for future research.
Dynamic Workloads – For workloads where the compression ratio varies dramatically per batch, the static pattern classification may need runtime re‑profiling, which adds modest overhead.

The authors suggest exploring adaptive pattern detection, tighter integration with columnar storage formats, and extending ZipFlow to support emerging hardware accelerators (e.g., DPUs) as next steps.

Authors

Gwangoo Yeo
Zhiyang Shen
Wei Cui
Matteo Interlandi
Rathijit Sen
Bailu Ding
Qi Chen
Minsoo Rhu

Paper Information

arXiv ID: 2602.08190v1
Categories: cs.DB, cs.AR, cs.DC
Published: February 9, 2026
PDF: Download PDF

[Paper] ZipFlow: a Compiler-based Framework to Unleash Compressed Data Movement for Modern GPUs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Min-Sum Uniform Coverage Problem by Autonomous Mobile Robots

[Paper] BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

[Paper] Computing Least Fixed Points with Overwrite Semantics in Parallel and Distributed Systems

[Paper] Implementability of Global Distributed Protocols modulo Network Architectures