The Game-Changer Breaking Data Lake's Impossible Triangle
Source: Dev.to
A Brief Introduction to Data Lake
1. Data Warehouse Recap
A data warehouse is a subject‑oriented data‑management system that aggregates data from different business systems for query and analysis.
- As data volumes grow and the number of source systems increases, a data warehouse becomes essential.
- To meet business requirements, raw data must be cleansed, transformed, and deeply prepared before it is loaded into the warehouse.
- The core task of a data warehouse is to answer pre‑defined business questions. Those questions must exist before a model is built.
Problem: What if a valuable business question has not yet been defined?
In a traditional warehouse, the workflow is:
- Identify a business question.
- Build a model to answer it.
This chain can be long, and because the warehouse stores highly prepared data, answering a new question often requires re‑processing raw data to achieve finer granularity—an expensive and inefficient operation, especially when many new questions arise.
2. Why Data Lakes Were Born
A data lake is a technology (or strategy) designed to store and analyze massive amounts of raw data. Its goals are:
- Store as much raw data as possible while preserving the highest possible fidelity.
- Extract any potential data value from the full dataset.
Thus, a data lake has two obvious roles:
| Role | Description |
|---|---|
| Data Storage | Keep all raw data (structured, semi‑structured, and unstructured) in its original state. |
| Data Analysis | Perform computing/value‑extraction on the stored data. |
3. Data Lake Performance in Two Aspects
Storage
- The lake stores full raw data—structured, semi‑structured, and unstructured—in its original form.
- This massive, diverse storage capability differentiates it from a data warehouse, which typically stores only structured data in databases.
- Loading data as early as possible enables:
- Full exploitation of cross‑domain associations.
- Better data security and integrity.
Implementation note: Modern storage and cloud technologies make massive raw‑data storage feasible. Enterprises can choose between self‑built storage clusters or cloud‑provider storage services.
Processing
The toughest challenge is data processing:
- The lake contains many data types, each requiring different processing methods.
- Structured data remains the central and most complex processing target, as both historical and newly generated business data are usually structured.
- In practice, semi‑structured and unstructured data are often transformed into structured formats for computation.
Current landscape
- SQL‑based databases (the same technologies used by data warehouses) dominate structured‑data processing.
- Consequently, most data‑lake solutions rely on a data warehouse (or a database) to compute structured data.
- The typical workflow:
- Store raw data in the lake.
- ETL (or ELT) the needed data into a warehouse for processing.
An advanced approach automates part of this pipeline: it identifies lake data that should be loaded into the warehouse and performs the loading during idle periods. This is the core idea behind the hot Lakehouse concept.
Bottom line: Today’s data lake is effectively three components:
- Massive data storage.
- A data‑warehouse layer for structured‑data processing.
- A specialized engine for semi‑structured/unstructured data.
4. Key Requirements for Data Lakes
| Requirement | Description |
|---|---|
| Original‑state storage | Load high‑fidelity data into the lake to preserve maximum value. |
| Sufficient computing capacity | Extract the maximum possible data value. |
| Cost‑effective development | Keep implementation and operational costs reasonable. |
Reality check: The current technology stack cannot satisfy all three simultaneously.
5. Approaches to Storing Data in Its Original Form
One‑to‑One Storage Mapping
- The simplest way to preserve fidelity is to store each source’s data in the same type of storage medium as the source.
- Example: MySQL data → MySQL storage in the lake, MongoDB data → MongoDB storage, etc.
- Benefits:
- Near‑perfect fidelity (hi‑fi).
- Leverages the source’s native computing capabilities for queries that involve only that source.
- Drawbacks:
- High development cost – you must provision and maintain many different storage systems.
- Heavy data‑migration workload (copying years of accumulated data).
- If a source uses commercial software, licensing adds further expense.
A cheaper variant is to store data in a different but similar system (e.g., Oracle data in MySQL). This reduces cost but can make some computations impossible or very hard.
Lowering the Bar
“Now, let’s lower the bar. We don’t demand that data be duplicated at loading but just sto…”
(The original text cuts off here. The intended continuation likely discusses a more pragmatic approach, such as storing raw files in a unified object store and applying schema‑on‑read techniques.)
6. Takeaways
- Data warehouses excel at answering pre‑defined questions on well‑prepared data.
- Data lakes aim to keep all raw data available for any future question, preserving fidelity and enabling broader analytics.
- The processing gap—especially for structured data—forces most lake implementations to lean on a warehouse or a lakehouse layer.
- Achieving high fidelity, strong compute, and low cost simultaneously remains an open challenge, driving ongoing innovation in lakehouse architectures and unified storage/computing platforms.
The “Impossible Triangle” of Data Lakes
When we try to load data into a relational database, we gain the database’s computing ability and satisfy the requirement of cheap development (as shown in part II of the figure).
However, this approach is often infeasible because it forces all data into a single relational database.
Why Direct Loading Fails
- Information loss can occur during the loading process, violating the first requirement of a data lake: loading high‑fidelity data.
- MongoDB → MySQL/Hive is a typical pain point: many MongoDB data types and relationships (nested structures, arrays, hashes, many‑to‑many links) simply do not exist in MySQL.
- To migrate, you must re‑structure the data, which involves a series of sophisticated re‑organization steps. This is cost‑ineffective, time‑consuming, and prone to hidden errors.
File‑Based Storage: A Partial Remedy
A common workaround is to store data unaltered in large files (or as large fields in a database).
Advantages
- Minimal information loss – data stays essentially intact.
- Greater flexibility, openness, and higher I/O efficiency.
- Cheaper storage (file systems are inexpensive).
Drawback
- Files (or large fields) lack computing capacity, making it impossible to meet the requirement of convenient/sufficient computing power.
- The “impossible triangle” (cost‑saving, high‑fidelity loading, and convenient computing) appears unbreakable.
The Root Cause
The conflict stems from the closed nature of traditional databases and their strict constraints:
- Data must be cleansed and transformed to satisfy schema rules before loading.
- This transformation inevitably leads to information loss.
Switching to pure file storage solves the fidelity issue but removes the computing engine, unless you resort to hard‑coding – which is far from convenient.
An Open Computing Engine Breaks the Triangle
An open computing engine can provide the missing piece: sufficient, convenient computing power that works directly on raw data stored in diverse sources, in real time.
SPL – Structured Data Computing Engine
- Open‑source and designed for data lakes.
- Offers diverse‑source mixed computing: it can compute raw data directly from its original storage (databases, files, NoSQL, RESTful APIs) without prior transformation.
- Works with any storage medium the lake uses – be it the same source types or plain files.
Key Benefits
| Benefit | Description |
|---|---|
| Agility | Data services become available immediately after the lake is established, bypassing the long cycle of preparation, loading, and modeling. |
| Real‑time response | Flexible lake services can react instantly to business needs. |
| File support | SPL gives files powerful computing capabilities, making file‑based lakes as fast—or even faster—than databases. |
| Hierarchical data | Native handling of JSON and other hierarchical formats; NoSQL and RESTful data can be used without transformation. |
All‑Around Computing Capacity
Direct Access to Source Data
- Joining in SPL makes a traditional data warehouse optional.
- SPL provides high‑performance file storage strategies that are flexible and easy to parallelize.
High‑Performance Storage Formats
| Format | Features |
|---|---|
| Bin file | • Compressed (smaller footprint, faster retrieval) • Stores data types (no parsing needed) • Supports double‑increment segmentation for easy parallel processing |
| Composite table | • Column‑wise storage – ideal when only a few columns are needed • Includes a min‑max index • Also supports double‑increment segmentation for parallelism |
Parallel Processing Made Simple
Many SPL functions (file retrieval, filtering, sorting, etc.) support parallel execution.
To enable multithreading, just add the @m option to the function call:
// Example: parallel sort
sort @m input_file output_file
This automatic multithreading lets you fully exploit multiple CPUs with minimal effort.
Bottom Line
By leveraging an open, source‑agnostic computing engine like SPL, you can:
- Load high‑fidelity data quickly (cost‑saving).
- Maintain data in its original form (openness).
- Provide sufficient, convenient computing power (performance).
In other words, SPL breaks the impossible triangle and makes building a truly open, efficient data lake feasible.
Parallel Execution
- SPL (Structured Processing Language) allows developers to write parallel programs explicitly, boosting computing performance.
High‑Performance Algorithms
-
SPL includes many algorithms that SQL cannot achieve efficiently.
-
A common example is the Top‑N operation:
- SPL treats Top‑N as an aggregate operation, converting a costly sort into a low‑complexity aggregation.
- The same statement works for retrieving the top N from an entire set or from grouped subsets, both delivering high performance.
-
No sort‑related keywords appear in SPL statements, so a full sort is never triggered.
Performance Advantages
- With these mechanisms, SPL can deliver orders‑of‑magnitude higher performance than traditional data warehouses.
- Storage and computation challenges that arise after data transformation are resolved, eliminating the need for a separate data lake.
Mixed‑Data Computation
- SPL can compute directly on both transformed data and raw data, leveraging values from heterogeneous data sources without prior preparation.
- This capability enables highly agile data lakes.
Simultaneous Lake‑Building Phases
-
Traditional pipelines require sequential steps: loading → transformation → computation.
-
SPL allows these phases to run concurrently:
- Data preparation and computation can be performed side‑by‑side.
- Any type of raw, irregular data can be processed directly.
-
Handling transformation and computation together—rather than in a serial order—is the key to building an ideal data lake.