Open Tables, Shared Truth: Architecting a Multi-Engine Lakehouse
Source: Dev.to
The Problem
- The same dataset is copied multiple times.
- The same metric produces different results.
- Governance logic is re‑implemented across systems.
Despite this, organizations confidently claim:
“We have a single source of truth.”
In practice, what exists are separate copies:
- A warehouse copy
- A lake copy
- A serving copy
Each is slightly different and “correct” only in its own context.
Why We Ended Up Here
Historically:
- Compute engines couldn’t agree on formats.
- Storage systems lacked transactional guarantees.
- Governance was tied to specific platforms.
Engineers responded by making pipelines the glue that held fragmented truth together—yet pipelines multiply truth rather than scale it.
The Shift to Open Table Formats
Data lakes were introduced with the promise of “store everything in one place,” delivering:
- Faster access
- Fewer ETL jobs
- Flexible analytics
But ownership didn’t change; the lake became accessible but not authoritative.
The New Unit of Ownership
The table, not the engine, is the unit of ownership.
Historically, engines owned data and pipelines moved data between them. Now, tables become shared, governed, and authoritative assets.
Open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi bring database‑like guarantees to object storage:
- ACID transactions
- Schema evolution
- Time travel
- Snapshot isolation
- Concurrent reads and writes
Multi‑Engine Compatibility
Multiple engines can read and write the same table reliably, e.g.:
- Apache Spark
- Trino
- Amazon Athena
- Snowflake
No duplication, no translation layers.
Modern Lakehouse Architecture
Traditional vs. Modern Approaches
| Aspect | Traditional | Modern |
|---|---|---|
| Data handling | Copied, pipelines everywhere, multiple versions of truth | Shared, minimal pipelines, one consistent truth |
| Storage role | Passive storage | System of record (authoritative) |
| Governance | Engine‑specific policies | Table‑level, centralized policies |
Layered Architecture
- Storage Layer – Object storage (e.g., Amazon S3)
- Table Layer – Open table formats (Iceberg / Delta / Hudi)
- Compute Layer – Multiple engines (Spark, Trino, Athena, etc.)
- Governance Layer – Centralized policy enforcement (e.g., AWS Lake Formation)
- Consumption Layer – BI, ML, APIs
Design Discipline
Even with open tables, teams can fall into traps:
- Treating tables like CSV files
- Lacking an ownership model
- Allowing unrestricted writes from every engine
- Ignoring cost and compaction strategies
Key Considerations
- Concurrent writers – Conflict‑resolution strategies
- Compaction ownership – Who maintains table performance?
- Performance tuning – Partitioning, indexing
- Failure domains – What breaks, and where?
These are platform decisions, not just engineering ones.
Organizational Change
The shift is bigger than technology; it’s an organizational transformation.
| From | To |
|---|---|
| Pipeline ownership | Data product ownership |
| System silos | Shared contracts |
| Tool‑centric thinking | Agreement‑centric thinking |
- Stop copying data.
- Start sharing truth.
- Design tables as products.
- Let engines be interchangeable.
“The most scalable analytics platforms are built around agreements, not tools.”
Conclusion
We’ve spent years optimizing how fast we process data. Now the crucial question is:
Where does truth live, and who owns it?
Until that’s solved, no amount of compute will fix your data platform.