The Game-Changer Breaking Data Lake's Impossible Triangle

Published: 3 months ago (February 4, 2026 at 03:11 AM EST)

9 min read

Source: Dev.to

Source: Dev.to

A Brief Introduction to Data Lake

1. Data Warehouse Recap

A data warehouse is a subject‑oriented data‑management system that aggregates data from different business systems for query and analysis.

As data volumes grow and the number of source systems increases, a data warehouse becomes essential.
To meet business requirements, raw data must be cleansed, transformed, and deeply prepared before it is loaded into the warehouse.
The core task of a data warehouse is to answer pre‑defined business questions. Those questions must exist before a model is built.

Problem: What if a valuable business question has not yet been defined?
In a traditional warehouse, the workflow is:
Identify a business question.
Build a model to answer it.

This chain can be long, and because the warehouse stores highly prepared data, answering a new question often requires re‑processing raw data to achieve finer granularity—an expensive and inefficient operation, especially when many new questions arise.

2. Why Data Lakes Were Born

A data lake is a technology (or strategy) designed to store and analyze massive amounts of raw data. Its goals are:

Store as much raw data as possible while preserving the highest possible fidelity.
Extract any potential data value from the full dataset.

Thus, a data lake has two obvious roles:

Role	Description
Data Storage	Keep all raw data (structured, semi‑structured, and unstructured) in its original state.
Data Analysis	Perform computing/value‑extraction on the stored data.

3. Data Lake Performance in Two Aspects

Storage

The lake stores full raw data—structured, semi‑structured, and unstructured—in its original form.
This massive, diverse storage capability differentiates it from a data warehouse, which typically stores only structured data in databases.
Loading data as early as possible enables:
- Full exploitation of cross‑domain associations.
- Better data security and integrity.

Implementation note: Modern storage and cloud technologies make massive raw‑data storage feasible. Enterprises can choose between self‑built storage clusters or cloud‑provider storage services.

Processing

The toughest challenge is data processing:

The lake contains many data types, each requiring different processing methods.
Structured data remains the central and most complex processing target, as both historical and newly generated business data are usually structured.
In practice, semi‑structured and unstructured data are often transformed into structured formats for computation.

Current landscape

SQL‑based databases (the same technologies used by data warehouses) dominate structured‑data processing.
Consequently, most data‑lake solutions rely on a data warehouse (or a database) to compute structured data.
The typical workflow:
1. Store raw data in the lake.
2. ETL (or ELT) the needed data into a warehouse for processing.

An advanced approach automates part of this pipeline: it identifies lake data that should be loaded into the warehouse and performs the loading during idle periods. This is the core idea behind the hot Lakehouse concept.

Bottom line: Today’s data lake is effectively three components:
Massive data storage.
A data‑warehouse layer for structured‑data processing.
A specialized engine for semi‑structured/unstructured data.

4. Key Requirements for Data Lakes

Requirement	Description
Original‑state storage	Load high‑fidelity data into the lake to preserve maximum value.
Sufficient computing capacity	Extract the maximum possible data value.
Cost‑effective development	Keep implementation and operational costs reasonable.

Reality check: The current technology stack cannot satisfy all three simultaneously.

5. Approaches to Storing Data in Its Original Form

One‑to‑One Storage Mapping

The simplest way to preserve fidelity is to store each source’s data in the same type of storage medium as the source.
- Example: MySQL data → MySQL storage in the lake, MongoDB data → MongoDB storage, etc.
Benefits:
- Near‑perfect fidelity (hi‑fi).
- Leverages the source’s native computing capabilities for queries that involve only that source.
Drawbacks:
- High development cost – you must provision and maintain many different storage systems.
- Heavy data‑migration workload (copying years of accumulated data).
- If a source uses commercial software, licensing adds further expense.

A cheaper variant is to store data in a different but similar system (e.g., Oracle data in MySQL). This reduces cost but can make some computations impossible or very hard.

Lowering the Bar

“Now, let’s lower the bar. We don’t demand that data be duplicated at loading but just sto…”

(The original text cuts off here. The intended continuation likely discusses a more pragmatic approach, such as storing raw files in a unified object store and applying schema‑on‑read techniques.)

6. Takeaways

Data warehouses excel at answering pre‑defined questions on well‑prepared data.
Data lakes aim to keep all raw data available for any future question, preserving fidelity and enabling broader analytics.
The processing gap—especially for structured data—forces most lake implementations to lean on a warehouse or a lakehouse layer.
Achieving high fidelity, strong compute, and low cost simultaneously remains an open challenge, driving ongoing innovation in lakehouse architectures and unified storage/computing platforms.

The “Impossible Triangle” of Data Lakes

When we try to load data into a relational database, we gain the database’s computing ability and satisfy the requirement of cheap development (as shown in part II of the figure).
However, this approach is often infeasible because it forces all data into a single relational database.

Why Direct Loading Fails

Information loss can occur during the loading process, violating the first requirement of a data lake: loading high‑fidelity data.
MongoDB → MySQL/Hive is a typical pain point: many MongoDB data types and relationships (nested structures, arrays, hashes, many‑to‑many links) simply do not exist in MySQL.
To migrate, you must re‑structure the data, which involves a series of sophisticated re‑organization steps. This is cost‑ineffective, time‑consuming, and prone to hidden errors.

File‑Based Storage: A Partial Remedy

A common workaround is to store data unaltered in large files (or as large fields in a database).

Advantages

Minimal information loss – data stays essentially intact.
Greater flexibility, openness, and higher I/O efficiency.
Cheaper storage (file systems are inexpensive).

Drawback

Files (or large fields) lack computing capacity, making it impossible to meet the requirement of convenient/sufficient computing power.
The “impossible triangle” (cost‑saving, high‑fidelity loading, and convenient computing) appears unbreakable.

The Root Cause

The conflict stems from the closed nature of traditional databases and their strict constraints:

Data must be cleansed and transformed to satisfy schema rules before loading.
This transformation inevitably leads to information loss.

Switching to pure file storage solves the fidelity issue but removes the computing engine, unless you resort to hard‑coding – which is far from convenient.

An Open Computing Engine Breaks the Triangle

An open computing engine can provide the missing piece: sufficient, convenient computing power that works directly on raw data stored in diverse sources, in real time.

SPL – Structured Data Computing Engine

Open‑source and designed for data lakes.
Offers diverse‑source mixed computing: it can compute raw data directly from its original storage (databases, files, NoSQL, RESTful APIs) without prior transformation.
Works with any storage medium the lake uses – be it the same source types or plain files.

Key Benefits

Benefit	Description
Agility	Data services become available immediately after the lake is established, bypassing the long cycle of preparation, loading, and modeling.
Real‑time response	Flexible lake services can react instantly to business needs.
File support	SPL gives files powerful computing capabilities, making file‑based lakes as fast—or even faster—than databases.
Hierarchical data	Native handling of JSON and other hierarchical formats; NoSQL and RESTful data can be used without transformation.

All‑Around Computing Capacity

Direct Access to Source Data

Joining in SPL makes a traditional data warehouse optional.
SPL provides high‑performance file storage strategies that are flexible and easy to parallelize.

High‑Performance Storage Formats

Format	Features
Bin file	• Compressed (smaller footprint, faster retrieval) • Stores data types (no parsing needed) • Supports double‑increment segmentation for easy parallel processing
Composite table	• Column‑wise storage – ideal when only a few columns are needed • Includes a min‑max index • Also supports double‑increment segmentation for parallelism

Parallel Processing Made Simple

Many SPL functions (file retrieval, filtering, sorting, etc.) support parallel execution.
To enable multithreading, just add the @m option to the function call:

// Example: parallel sort
sort @m input_file output_file

This automatic multithreading lets you fully exploit multiple CPUs with minimal effort.

Bottom Line

By leveraging an open, source‑agnostic computing engine like SPL, you can:

Load high‑fidelity data quickly (cost‑saving).
Maintain data in its original form (openness).
Provide sufficient, convenient computing power (performance).

In other words, SPL breaks the impossible triangle and makes building a truly open, efficient data lake feasible.

Parallel Execution

SPL (Structured Processing Language) allows developers to write parallel programs explicitly, boosting computing performance.

High‑Performance Algorithms

SPL includes many algorithms that SQL cannot achieve efficiently.
A common example is the Top‑N operation:
- SPL treats Top‑N as an aggregate operation, converting a costly sort into a low‑complexity aggregation.
- The same statement works for retrieving the top N from an entire set or from grouped subsets, both delivering high performance.
No sort‑related keywords appear in SPL statements, so a full sort is never triggered.

Performance Advantages

With these mechanisms, SPL can deliver orders‑of‑magnitude higher performance than traditional data warehouses.
Storage and computation challenges that arise after data transformation are resolved, eliminating the need for a separate data lake.

Mixed‑Data Computation

SPL can compute directly on both transformed data and raw data, leveraging values from heterogeneous data sources without prior preparation.
This capability enables highly agile data lakes.

Simultaneous Lake‑Building Phases

Traditional pipelines require sequential steps: loading → transformation → computation.
SPL allows these phases to run concurrently:
- Data preparation and computation can be performed side‑by‑side.
- Any type of raw, irregular data can be processed directly.
Handling transformation and computation together—rather than in a serial order—is the key to building an ideal data lake.