[Paper] Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems

Published: 2 days ago (April 23, 2026 at 05:07 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21449v1

Overview

A recent study by Ivan Borodii and Halyna Osukhivska benchmarks three leading Data Lakehouse technologies—Apache Hudi, Apache Iceberg, and Delta Lake—using Apache Spark as the processing engine. By loading both structured (CSV) and semi‑structured (JSON) datasets up to 7 GB, the authors pinpoint which architecture delivers the best trade‑off between load speed and storage footprint for analytical workloads.

Key Contributions

First‑ever side‑by‑side performance comparison of Hudi, Iceberg, and Delta Lake on identical Spark ETL pipelines.
Evaluation of two core metrics: (1) data loading time, and (2) on‑disk size of the resulting tables.
Insightful guidance on which lakehouse to pick based on data type, volume, and priority (speed vs. storage efficiency).
Open‑source‑ready experimental framework (four sequential ETL jobs) that can be reused for further benchmarking.

Methodology

Data Sets – Two formats were used:
- CSV (fully structured)
- JSON (semi‑structured)
  Files ranged from a few hundred megabytes to 7 GB to simulate realistic batch loads.
Processing Engine – All jobs ran on Apache Spark (the same Spark version, cluster size, and configuration) to keep the compute layer constant.
ETL Workflow – For each lakehouse system the authors executed four sequential steps:
- Read the source files into Spark DataFrames.
- Transform (simple type casting, column renaming).
- Write the data into a lakehouse table using the system‑specific write API.
- Commit the transaction (where applicable).
Metrics Captured –
- Loading Time: wall‑clock time from Spark read to successful commit.
- Disk Footprint: total size of the table directory after compaction/optimization (if the system performed it automatically).
Repetition & Averaging – Each experiment was repeated multiple times; results were averaged to smooth out transient cluster noise.

Results & Findings

Lakehouse	Best for Loading Speed	Best for Storage Efficiency	Observations
Delta Lake	Fastest across all file sizes and both CSV/JSON. Consistently lower load times even for 7 GB files.	Moderate – larger on‑disk size than Iceberg, but still acceptable.	Optimized write path and efficient transaction log handling give it an edge in pure throughput scenarios.
Apache Iceberg	Slightly slower than Delta Lake but still competitive.	Most compact storage; benefits from built‑in file‑pruning and metadata management.	Ideal when disk cost or long‑term archival matters more than raw ingest speed.
Apache Hudi	Slowest in batch load tests; high latency for both CSV and JSON.	Larger on‑disk footprint compared to Iceberg.	Not suited for bulk batch loads; shines in incremental upserts and streaming use‑cases (outside the scope of this study).

Data type impact: JSON (semi‑structured) incurred a modest overhead for all systems, but the relative ranking remained unchanged.
Scale impact: As file size grew, the performance gap between Delta Lake and the others widened, confirming Delta’s scalability for high‑volume ingestion.

Practical Implications

Data Engineers can adopt Delta Lake when building pipelines that prioritize fast, repeatable batch loads (e.g., nightly data warehouses, ETL refreshes).
Architects designing cost‑sensitive analytical platforms (e.g., multi‑tenant lakehouses, long‑term data lakes) may favor Apache Iceberg to squeeze out storage savings without sacrificing reasonable ingest speed.
Streaming / CDC workloads: Although Hudi lagged in this batch‑oriented benchmark, its upsert‑centric design makes it a strong candidate for real‑time change‑data‑capture pipelines, event‑driven analytics, or IoT streams.
The study’s Spark‑centric benchmark suite can be dropped into CI pipelines to continuously validate performance after cluster upgrades or configuration tweaks.

Limitations & Future Work

Scope limited to batch ETL – Incremental, streaming, and merge‑on‑read scenarios (where Hudi typically excels) were not evaluated.
Experiments ran on a single Spark cluster configuration; results may vary with different hardware, cloud providers, or Spark tuning parameters.
Only CSV and JSON formats were tested; other common sources (Parquet, Avro, ORC) could affect compression and layout behavior.
Future research could extend the benchmark to concurrent query workloads, metadata scalability, and cost modeling across cloud storage tiers.

Authors

Ivan Borodii
Halyna Osukhivska

Paper Information

arXiv ID: 2604.21449v1
Categories: cs.DC, cs.DB
Published: April 23, 2026
PDF: Download PDF

[Paper] Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Leveraging SIMD for Accelerating Large-number Arithmetic

[Paper] Systematizing Blockchain Research Themes and Design Patterns: Insights from the University Blockchain Research Initiative (UBRI)

[Paper] Risk-Aware and Stable Edge Server Selection Under Network Latency SLOs

[Paper] A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks