[Paper] Do GPUs Really Need New Tabular File Formats?

Published: (February 19, 2026 at 08:07 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17335v1

Overview

Parquet is the go‑to columnar storage format for modern analytics, but its default settings were tuned for CPU‑only workloads. As more data pipelines move to GPU‑accelerated engines, those same defaults can cripple performance, turning what should be a fast scan into a bottleneck. This paper investigates why Parquet “fails” on GPUs and shows that the issue is not the format itself but the way it is configured. By simply tweaking Parquet’s layout parameters for GPU characteristics, the authors achieve up to 125 GB/s effective read bandwidth—without changing the Parquet spec at all.

Key Contributions

  • Systematic analysis of how individual Parquet configuration knobs (row group size, page size, compression, encoding) impact GPU scan throughput.
  • Identification of GPU‑unfriendly defaults (e.g., tiny row groups, sub‑optimal page sizes) that cause excessive kernel launches and low memory coalescing.
  • GPU‑aware configuration guidelines that maximize parallelism and memory bandwidth while staying fully compatible with existing Parquet readers.
  • Empirical validation on multiple GPU platforms (NVIDIA A100, RTX 4090) showing up to 125 GB/s read bandwidth—a >3× speed‑up over default settings.
  • Open‑source tooling that automatically rewrites Parquet metadata to the recommended settings, enabling drop‑in adoption.

Methodology

  1. Benchmark Suite – The authors built a micro‑benchmark that measures raw GPU read bandwidth for a variety of Parquet files, isolating the scan phase from downstream processing.
  2. Parameter Sweep – They varied key Parquet parameters (row‑group size, column‑chunk size, page size, dictionary vs. plain encoding, compression codec) across a wide range while keeping the data content constant.
  3. GPU Profiling – Using NVIDIA Nsight and CUPTI, they captured kernel launch counts, memory transaction efficiency, and occupancy to understand why certain configurations performed poorly.
  4. Guideline Derivation – From the profiling data, they derived a set of “GPU‑friendly” defaults (e.g., larger row groups ≈ 256 MiB, page sizes ≈ 8 MiB, avoid dictionary encoding for high‑cardinality columns).
  5. Cross‑validation – The recommended settings were tested on different hardware generations and with popular GPU‑accelerated query engines (BlazingSQL, RAPIDS cuDF) to ensure robustness.

Results & Findings

Configuration AspectDefault (CPU‑centric)GPU‑aware RecommendationBandwidth Impact
Row‑group size64 MiB256 MiB – 512 MiB+2.1×
Page size1 MiB8 MiB – 16 MiB+1.8×
Encoding (high‑card)DictionaryPlain/Delta+1.4×
Compression codecSnappy (default)ZSTD (higher ratio) or none for raw scans+1.2× (when I/O bound)
Column‑chunk layoutSmall chunks per columnLarger contiguous chunks+1.5×

Overall, the best‑case configuration delivered 125 GB/s effective read bandwidth on an A100, compared to ≈ 38 GB/s with the stock Parquet defaults—a >3× improvement. Importantly, these gains were achieved without any changes to the Parquet file format; only the metadata (row‑group and page boundaries) needed adjustment.

Practical Implications

  • Faster ETL pipelines – Data ingestion jobs that simply scan Parquet files on GPUs can now run in a fraction of the time, reducing end‑to‑end latency for analytics dashboards.
  • Cost savings on cloud GPU instances – Higher throughput means fewer GPU hours needed for the same workload, translating directly into lower cloud spend.
  • Seamless integration – Because the Parquet spec remains untouched, existing tools (Spark, Hive, Presto) can still read the files; only GPU‑accelerated engines benefit from the tuned layout.
  • Tooling support – The authors’ open‑source metadata rewriter can be plugged into CI pipelines to automatically produce GPU‑optimized Parquet files during data export.
  • Guidance for data lake architects – When designing a lake that will serve both CPU and GPU workloads, teams can now choose a “dual‑mode” configuration that balances row‑group size for CPUs while still delivering decent GPU performance, or maintain separate “GPU‑optimized” copies for hot data.

Limitations & Future Work

  • CPU vs. GPU trade‑offs – Some GPU‑friendly settings (larger row groups) may degrade CPU scan performance; the paper does not explore a unified configuration that optimally serves both.
  • Workload diversity – Benchmarks focus on pure scan bandwidth; downstream operations (joins, aggregations) could exhibit different sensitivities to layout.
  • Hardware scope – Experiments are limited to NVIDIA GPUs; AMD or Intel GPUs may have different optimal page/row‑group sizes.
  • Dynamic adaptation – Future work could investigate runtime‑aware file layout selection, where the storage system automatically rewrites or shards Parquet files based on observed workload patterns.

Overall, the study demonstrates that “new” file formats aren’t required for GPU analytics—just smarter configuration of the trusted Parquet format. This insight can immediately boost performance for any organization already leveraging GPU‑accelerated data processing.

Authors

  • Jigao Luo
  • Qi Chen
  • Carsten Binnig

Paper Information

  • arXiv ID: 2602.17335v1
  • Categories: cs.DB, cs.DC
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »