[Paper] ZipFlow: a Compiler-based Framework to Unleash Compressed Data Movement for Modern GPUs
Source: arXiv - 2602.08190v1
Overview
The paper introduces ZipFlow, a compiler‑driven framework that automatically restructures the whole pipeline of compress‑transfer‑decompress for GPU‑accelerated analytics. By treating compression algorithms as patterns of parallelism, ZipFlow can generate GPU kernels that squeeze the most out of modern PCIe‑Gen4/5 and GPU compute units, cutting data‑movement latency and boosting end‑to‑end query performance.
Key Contributions
- Pattern‑based classification of compression algorithms into three parallelism categories, enabling a unified optimization strategy.
- Compiler‑level scheduling that automatically maps each pattern to the most efficient GPU execution model (e.g., warp‑level, block‑level, or cooperative groups).
- Holistic end‑to‑end optimization that co‑optimizes compression, PCIe transfer, and decompression, rather than treating them as isolated steps.
- Portable performance across diverse GPU architectures (NVIDIA Ampere, Ada Lovelace, etc.) without hand‑tuned kernel code.
- Empirical validation on the TPC‑H benchmark showing a 2.08× speedup over the best existing GPU compression library (nvCOMP) and 3.14× over CPU‑only engines like DuckDB.
Methodology
-
Pattern Identification – The authors analyze a wide range of lossless compressors (e.g., Snappy, LZ4, ZSTD) and distill three core parallelism patterns:
- Embarrassingly parallel (independent blocks)
- Fine‑grained data‑dependent (byte‑wise streams)
- Hybrid (mix of block‑level and intra‑block parallelism)
-
Compiler Front‑end – ZipFlow extends LLVM with custom passes that recognize these patterns in user‑provided compression code (or library calls) and annotate them with metadata.
-
Scheduling Engine – Based on the pattern tag, the engine selects a pre‑tuned kernel template:
- Block‑level kernels for independent blocks (maximizing occupancy)
- Warp‑level cooperative kernels for data‑dependent streams (leveraging warp shuffles)
- Hybrid kernels that dynamically switch between the two during execution
-
Data‑Movement Orchestration – The generated code pipelines compression on the CPU (or a dedicated “compression GPU”) with PCIe DMA, overlapping transfer with decompression on the target GPU. The compiler inserts asynchronous copy commands and stream synchronizations to hide latency.
-
Auto‑tuning – A lightweight runtime profiler evaluates a few candidate configurations (block size, shared‑memory usage) on the target hardware and picks the best one for the workload.
Results & Findings
| Metric | Baseline (nvCOMP) | Baseline (DuckDB) | ZipFlow |
|---|---|---|---|
| TPC‑H Q1 latency | 12.4 s | 38.7 s | 5.9 s |
| Average speedup (all queries) | 1.0× | 1.0× | 2.08× (vs. nvCOMP) / 3.14× (vs. DuckDB) |
| PCIe bandwidth utilization | ~45 % | N/A | ~78 % (thanks to overlapped transfer) |
| GPU compute utilization | 30 % (compression only) | N/A | 62 % (compression + decompression) |
Key takeaways
- Matching the right parallelism pattern to the GPU’s execution model reduces compression/decompression overhead by up to 45 %.
- Overlapping transfer with computation yields a 33 % reduction in effective I/O latency.
- The framework scales across GPU generations without manual retuning, confirming its portability.
Practical Implications
- Data‑Lake Ingestion Pipelines – Engineers can plug ZipFlow into ETL jobs that move terabytes of CSV/Parquet data into GPU‑accelerated analytics engines (e.g., RAPIDS, BlazingSQL) and expect roughly 2× faster ingestion.
- Real‑time Dashboards – For low‑latency BI dashboards that rely on GPU‑powered query engines, ZipFlow’s reduced transfer time translates directly into fresher data and tighter SLAs.
- Cost Savings – Faster end‑to‑end queries mean fewer GPU seconds per workload, which can lower cloud GPU spend, especially when PCIe bandwidth is a billed resource (e.g., on AWS EC2 with Elastic Fabric Adapter).
- Developer Productivity – Because ZipFlow works at the compiler level, developers keep writing code against familiar compression libraries; the heavy lifting of kernel selection and stream orchestration is automated.
- Future‑Proofing – As PCIe 5.0 and NVLink become mainstream, ZipFlow’s pattern‑based approach will continue to extract the best trade‑off between I/O and compute without rewriting kernels.
Limitations & Future Work
- Compression Algorithm Coverage – The study focuses on a subset of widely used lossless compressors; exotic or domain‑specific codecs may not fit cleanly into the three patterns.
- CPU‑Side Compression Overhead – ZipFlow assumes the CPU can keep up with the GPU’s consumption rate; on systems with weak CPUs, the pipeline may still be bottlenecked before the data even reaches the GPU.
- Multi‑GPU Scaling – The current implementation optimizes a single‑GPU data path; extending the scheduler to coordinate compression across multiple GPUs (e.g., in a DGX‑2) is left for future research.
- Dynamic Workloads – For workloads where the compression ratio varies dramatically per batch, the static pattern classification may need runtime re‑profiling, which adds modest overhead.
The authors suggest exploring adaptive pattern detection, tighter integration with columnar storage formats, and extending ZipFlow to support emerging hardware accelerators (e.g., DPUs) as next steps.
Authors
- Gwangoo Yeo
- Zhiyang Shen
- Wei Cui
- Matteo Interlandi
- Rathijit Sen
- Bailu Ding
- Qi Chen
- Minsoo Rhu
Paper Information
- arXiv ID: 2602.08190v1
- Categories: cs.DB, cs.AR, cs.DC
- Published: February 9, 2026
- PDF: Download PDF