Tired of ETL Bottlenecks? Build a Logical Data Warehouse with SPL
Source: Dev.to
Logical Data Warehouse (DW)
Logical DW offers users the ability to logically integrate a variety of data sources without moving the original data, presenting itself as a physical DW. It can address the traditional DW’s problem of being unable to respond to real‑time data‑processing needs because of the long data chain caused by data movement. Consequently, logical DW can meet fast‑changing business scenarios and provides cross‑source computing capability.
However, because it lacks physical storage, logical DW must map each source’s data as an SQL table to enable mixed computation across multiple sources.
1. Current Implementation Issues
- SQL‑Centric Interfaces – Most logical DWs expose an SQL interface because traditional DWs are built on SQL. SQL is universal and lowers learning/application thresholds.
- Weak Logical Ability – SQL cannot fully support diverse data sources.
- Mapping Limitations – Many data sources do not satisfy DW (SQL) constraints, making it impractical to map them as SQL tables. Physical DWs load data into a database to meet constraints; logical DWs must handle the diversity directly.
1.1. Limited Source Support
| Source Type | Support Level in Current Logical DWs |
|---|---|
| RDBMS (SQL) | Relatively easy |
| NoSQL (e.g., MongoDB) | Poor |
| Web services / JSON | Poor |
| File systems | Very poor |
Most logical DWs excel only with RDBMS and perform poorly with other source types.
1.2. Functional Gaps
- Different RDBMSs have dialects that expose unique capabilities.
- A logical DW that only understands generic SQL cannot leverage these dialect‑specific features.
- Non‑SQL databases (e.g., MongoDB) use completely different query syntaxes, which SQL cannot express.
- Ideally, a logical DW should allow direct use of the source’s native syntax in addition to any automatic translation.
1.3. Physical Computing Deficiency
- Reading large volumes of data from diverse sources incurs high I/O cost, leading to unacceptable latency.
- To guarantee performance, logical DWs sometimes provide physical computing abilities (e.g., temporary storage), but the gap with physical DWs remains large due to entrenched habits and limited adaptive storage mechanisms.
Bottom line: A purely logical DW works only for small data volumes and low‑performance scenarios. It must combine physical computing (for performance) with logical data‑source flexibility.
2. Proposed Solution: SPL‑Based Logical DW
SPL (open‑source computing engine) offers:
- Open, extensible computing ability – can integrate multiple data‑source types for mixed computation.
- Powerful physical computing – high‑performance guarantees and adaptive storage.
- Logical cross‑source computing – enables true logical DW functionality.
2.1. Data‑Source Handling in SPL
- SPL treats sources as table sequences (small data) or cursors (big data) instead of mapping them to database tables.
- Generation of the table sequence/cursor is the responsibility of the data source itself (any source can expose such an interface, even if it cannot provide a unified SQL access layer).
- This approach fully utilizes each source’s native capabilities.
2.2. Cross‑Source Mixed Computation Example
-- Example: mixed computation across different databases
-- (pseudo‑code; actual SPL syntax may vary)
-- Load a small‑size relational table as a sequence
seq_orders = source("jdbc:mysql://host/db1", "orders")
-- Load a large‑size NoSQL collection as a cursor
cur_events = source("mongodb://host/db2", "events")
-- Perform a join using SPL’s native operators
result = join(seq_orders, cur_events,
on = seq_orders.customer_id == cur_events.custId)
-- Output the result
output(result, "hdfs://path/to/result")
The code demonstrates how SPL can seamlessly combine a relational table and a MongoDB collection without forcing either into a traditional SQL table.
2.3. Translation vs. Native Syntax
- SPL provides a SQL‑to‑native translation layer, similar to existing DWs, to handle dialect differences.
- More importantly, SPL supports direct use of each data source’s native syntax, allowing developers to exploit source‑specific features (e.g., MongoDB’s aggregation pipeline) while still participating in cross‑source workflows.
Cross‑Database Computation
SPL can work with any data source—whether it is a SQL dialect or a NoSQL store. In addition to cross‑database computation, SPL can perform mixed calculations between data sources of any type.
Example – real‑time query that combines cold data stored in a file system with hot data kept in a database:
/* SPL code goes here – example omitted for brevity */
SPL also integrates non‑relational sources. It provides strong support for multi‑layer data structures, making it convenient to process data from Web interfaces, IoT devices, and NoSQL stores.
Example – reading JSON multi‑layer data and performing an association query with a database:
/* SPL code for JSON → DB association */
Example – working with MongoDB (a NoSQL database):
/* SPL code for MongoDB integration */
Example – mixed computing of RESTful data and plain‑text data:
/* SPL code for RESTful + text data */
Thus, SPL offers independent computing ability that is agnostic to the data source, while still allowing the source‑specific features to be leveraged. Users can decide where the calculation occurs—at the data‑source end or within the logical data‑warehouse (SPL)—which is the core of SPL’s flexibility.
Physical Computing Ability
SPL introduces a professional structured data object called table sequence and supplies a rich library of operations built on it, giving SPL complete yet simple structured‑data processing capabilities.
Common SPL Calculation Snippets
| Operation | SPL Code |
|---|---|
| Sort | Orders.sort(Amount) |
| Filter | Orders.select(Amount*Quantity > 3000 && like(Client, "*S*")) |
| Group | Orders.groups(Client; sum(Amount)) |
| Distinct | Orders.id(Client) |
| Join | join(Orders:o, SellerId ; Employees:e, EId) |
Through procedural computation and table sequences, SPL can implement many more calculations, such as ordered operations, grouping that retains subsets (sets‑of‑sets), and further processing on grouped results. Compared with SQL, SPL’s syntax differs significantly—differences that become advantages (discussed later).
High‑Performance Guarantee Mechanisms
SPL combines logical DW (the abstraction layer) with physical DW (the execution engine). The following high‑performance algorithms are built into SPL:
- In‑memory computing – binary search, sequence‑number positioning, position index, hash index, multi‑layer sequence positioning, …
- External‑storage search – binary search, hash index, sorting index, index‑with‑values, full‑text retrieval, …
- Traversal computing – delayed cursor, multipurpose traversal, parallel multi‑cursor, ordered grouping & aggregating, sequence‑number grouping, …
- Foreign‑key association – foreign‑key addressization, foreign‑key sequence‑numberization, index reuse, aligned sequence, one‑side partitioning, …
- Merge & join – ordered merging, segment‑wise merge, association positioning, attached table, …
- Multidimensional analysis – partial pre‑aggregation, time‑period pre‑aggregation, redundant sorting, boolean‑dimension sequence, tag‑bit dimension, …
- Cluster computing – cluster multi‑zone composite table, duplicate dimension table, segmented dimension table, redundancy‑pattern fault tolerance, spare‑wheel‑pattern fault tolerance, load balancing, …
Storage‑Aware Optimizations
Logical and physical calculations cannot be separated from data storage. Organizing data for the computing goal (e.g., sorting by a specific field) can dramatically boost performance, and many high‑performance algorithms rely on storage support.
SPL therefore provides high‑performance file storage—distinct from the closed storage of traditional databases. From a logical perspective, SPL’s high‑performance files behave like any other data source, but SPL adds engineering methods (compression, columnar storage, indexing) to improve speed. Numerous high‑performance algorithms are built on top of this file storage.
Physical storage gives SPL a computing ability that pure logical DWs cannot match, delivering a significant performance advantage over other physical DWs. In real‑world scenarios, SPL often achieves several‑ to dozens‑fold performance gains.
Performance‑Improvement Cases
- Open‑source SPL turned a pre‑association query on a bank’s mobile account system into a real‑time association.
Overall, SPL’s complete, high‑performance computing ability—combined with rich interfaces for diverse data sources—makes it a strong candidate for building logical data warehouses.
More Lightweight
- Low hardware requirements – SPL runs on any OS with a JVM (JDK 1.8+), including common VMs and containers.
- Small footprint – Installation size is **
A programming language coding in a grid – (link to documentation)
The Bigger Picture
A logical data warehouse must balance:
- Logical ability – expressive, complete language features.
- Physical computing ability – efficient execution on underlying storage.
- Data‑source integration – seamless connectivity to varied sources.
- Data‑type support – handling structured, semi‑structured, and time‑series data.
- Performance guarantees – predictable, scalable query execution.
- Ease of use – low learning curve and intuitive tooling.
- Development & O&M cost – minimal overhead and operational complexity.
SPL checks all these boxes, making it a strong candidate for building a logical DW.
Get Started
SPL is open‑source. Grab the source code from GitHub and try it out for free.
https://github.com/your‑org/spl