[Paper] Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems
Source: arXiv - 2604.21449v1
Overview
A recent study by Ivan Borodii and Halyna Osukhivska benchmarks three leading Data Lakehouse technologies—Apache Hudi, Apache Iceberg, and Delta Lake—using Apache Spark as the processing engine. By loading both structured (CSV) and semi‑structured (JSON) datasets up to 7 GB, the authors pinpoint which architecture delivers the best trade‑off between load speed and storage footprint for analytical workloads.
Key Contributions
- First‑ever side‑by‑side performance comparison of Hudi, Iceberg, and Delta Lake on identical Spark ETL pipelines.
- Evaluation of two core metrics: (1) data loading time, and (2) on‑disk size of the resulting tables.
- Insightful guidance on which lakehouse to pick based on data type, volume, and priority (speed vs. storage efficiency).
- Open‑source‑ready experimental framework (four sequential ETL jobs) that can be reused for further benchmarking.
Methodology
-
Data Sets – Two formats were used:
- CSV (fully structured)
- JSON (semi‑structured)
Files ranged from a few hundred megabytes to 7 GB to simulate realistic batch loads.
-
Processing Engine – All jobs ran on Apache Spark (the same Spark version, cluster size, and configuration) to keep the compute layer constant.
-
ETL Workflow – For each lakehouse system the authors executed four sequential steps:
- Read the source files into Spark DataFrames.
- Transform (simple type casting, column renaming).
- Write the data into a lakehouse table using the system‑specific write API.
- Commit the transaction (where applicable).
-
Metrics Captured –
- Loading Time: wall‑clock time from Spark read to successful commit.
- Disk Footprint: total size of the table directory after compaction/optimization (if the system performed it automatically).
-
Repetition & Averaging – Each experiment was repeated multiple times; results were averaged to smooth out transient cluster noise.
Results & Findings
| Lakehouse | Best for Loading Speed | Best for Storage Efficiency | Observations |
|---|---|---|---|
| Delta Lake | Fastest across all file sizes and both CSV/JSON. Consistently lower load times even for 7 GB files. | Moderate – larger on‑disk size than Iceberg, but still acceptable. | Optimized write path and efficient transaction log handling give it an edge in pure throughput scenarios. |
| Apache Iceberg | Slightly slower than Delta Lake but still competitive. | Most compact storage; benefits from built‑in file‑pruning and metadata management. | Ideal when disk cost or long‑term archival matters more than raw ingest speed. |
| Apache Hudi | Slowest in batch load tests; high latency for both CSV and JSON. | Larger on‑disk footprint compared to Iceberg. | Not suited for bulk batch loads; shines in incremental upserts and streaming use‑cases (outside the scope of this study). |
- Data type impact: JSON (semi‑structured) incurred a modest overhead for all systems, but the relative ranking remained unchanged.
- Scale impact: As file size grew, the performance gap between Delta Lake and the others widened, confirming Delta’s scalability for high‑volume ingestion.
Practical Implications
- Data Engineers can adopt Delta Lake when building pipelines that prioritize fast, repeatable batch loads (e.g., nightly data warehouses, ETL refreshes).
- Architects designing cost‑sensitive analytical platforms (e.g., multi‑tenant lakehouses, long‑term data lakes) may favor Apache Iceberg to squeeze out storage savings without sacrificing reasonable ingest speed.
- Streaming / CDC workloads: Although Hudi lagged in this batch‑oriented benchmark, its upsert‑centric design makes it a strong candidate for real‑time change‑data‑capture pipelines, event‑driven analytics, or IoT streams.
- The study’s Spark‑centric benchmark suite can be dropped into CI pipelines to continuously validate performance after cluster upgrades or configuration tweaks.
Limitations & Future Work
- Scope limited to batch ETL – Incremental, streaming, and merge‑on‑read scenarios (where Hudi typically excels) were not evaluated.
- Experiments ran on a single Spark cluster configuration; results may vary with different hardware, cloud providers, or Spark tuning parameters.
- Only CSV and JSON formats were tested; other common sources (Parquet, Avro, ORC) could affect compression and layout behavior.
- Future research could extend the benchmark to concurrent query workloads, metadata scalability, and cost modeling across cloud storage tiers.
Authors
- Ivan Borodii
- Halyna Osukhivska
Paper Information
- arXiv ID: 2604.21449v1
- Categories: cs.DC, cs.DB
- Published: April 23, 2026
- PDF: Download PDF