Open Tables, Shared Truth: Architecting a Multi-Engine Lakehouse

Published: (March 31, 2026 at 03:35 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

The Problem

  • The same dataset is copied multiple times.
  • The same metric produces different results.
  • Governance logic is re‑implemented across systems.

Despite this, organizations confidently claim:

“We have a single source of truth.”

In practice, what exists are separate copies:

  • A warehouse copy
  • A lake copy
  • A serving copy

Each is slightly different and “correct” only in its own context.

Why We Ended Up Here

Historically:

  1. Compute engines couldn’t agree on formats.
  2. Storage systems lacked transactional guarantees.
  3. Governance was tied to specific platforms.

Engineers responded by making pipelines the glue that held fragmented truth together—yet pipelines multiply truth rather than scale it.

The Shift to Open Table Formats

Data lakes were introduced with the promise of “store everything in one place,” delivering:

  • Faster access
  • Fewer ETL jobs
  • Flexible analytics

But ownership didn’t change; the lake became accessible but not authoritative.

The New Unit of Ownership

The table, not the engine, is the unit of ownership.

Historically, engines owned data and pipelines moved data between them. Now, tables become shared, governed, and authoritative assets.

Open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi bring database‑like guarantees to object storage:

  • ACID transactions
  • Schema evolution
  • Time travel
  • Snapshot isolation
  • Concurrent reads and writes

Multi‑Engine Compatibility

Multiple engines can read and write the same table reliably, e.g.:

  • Apache Spark
  • Trino
  • Amazon Athena
  • Snowflake

No duplication, no translation layers.

Modern Lakehouse Architecture

Traditional vs. Modern Approaches

AspectTraditionalModern
Data handlingCopied, pipelines everywhere, multiple versions of truthShared, minimal pipelines, one consistent truth
Storage rolePassive storageSystem of record (authoritative)
GovernanceEngine‑specific policiesTable‑level, centralized policies

Layered Architecture

  1. Storage Layer – Object storage (e.g., Amazon S3)
  2. Table Layer – Open table formats (Iceberg / Delta / Hudi)
  3. Compute Layer – Multiple engines (Spark, Trino, Athena, etc.)
  4. Governance Layer – Centralized policy enforcement (e.g., AWS Lake Formation)
  5. Consumption Layer – BI, ML, APIs

Design Discipline

Even with open tables, teams can fall into traps:

  • Treating tables like CSV files
  • Lacking an ownership model
  • Allowing unrestricted writes from every engine
  • Ignoring cost and compaction strategies

Key Considerations

  • Concurrent writers – Conflict‑resolution strategies
  • Compaction ownership – Who maintains table performance?
  • Performance tuning – Partitioning, indexing
  • Failure domains – What breaks, and where?

These are platform decisions, not just engineering ones.

Organizational Change

The shift is bigger than technology; it’s an organizational transformation.

FromTo
Pipeline ownershipData product ownership
System silosShared contracts
Tool‑centric thinkingAgreement‑centric thinking
  • Stop copying data.
  • Start sharing truth.
  • Design tables as products.
  • Let engines be interchangeable.

“The most scalable analytics platforms are built around agreements, not tools.”

Conclusion

We’ve spent years optimizing how fast we process data. Now the crucial question is:

Where does truth live, and who owns it?

Until that’s solved, no amount of compute will fix your data platform.

0 views
Back to Blog

Related posts

Read more »

CSV: The Format Nobody Designed

By Design — Episode 02 No specification. No schema. No data types. No standard encoding. No committee. No owner. No version number. In 1972, IBM's Fortran comp...