Why Does Real-Time UPDATE Synchronization Require Two Records? A Full-Chain Breakdown in Apache SeaTunnel

Published: (December 4, 2025 at 03:03 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Introduction

In real‑time data platforms—real‑time warehouses, data lake ingestion, and distributed data replication—CDC (Change Data Capture) has become a standard capability for building modern pipelines. Whether you are loading data into StarRocks, Doris, ClickHouse, Iceberg, Paimon, or Hudi, or synchronizing across databases, CDC is the core foundation.

A common CDC question is often overlooked:

  • Why must a MySQL CDC UPDATE event output two records—one BEFORE and one AFTER?
  • Why can’t it output only the final new value?
  • If we only have AFTER, can’t the system still sync correctly?

At first glance it seems possible, but consistency, idempotency, replayability, primary‑key handling, data‑lake merge semantics, and out‑of‑order recovery reveal that splitting UPDATE into BEFORE + AFTER is not a “format choice”—it is fundamental to CDC correctness.

The Real Structure of UPDATE in MySQL Binlog

MySQL does not write an UPDATE as a single record. In row‑based binlog format the event looks like:

update_rows_event {
    before_image: {id: 1, price: 100}
    after_image:  {id: 1, price: 200}
}

When you execute:

UPDATE t SET price = 200 WHERE id = 1;

the binlog records both the old values (BEFORE) and the new values (AFTER). This pair fully describes the state transition, enabling correct replay, rollback, and verification of transactions.

Why CDC Cannot Represent UPDATE with Only One Record

Scenario 1 – Detecting Real Changes

UPDATE t SET price = 200 WHERE id = 1;

If the original price was already 200, an AFTER‑only event ({id:1, price:200}) gives downstream systems no way to know whether the data actually changed. This leads to:

  • Unnecessary writes to data lakes (expensive Iceberg merges)
  • Incorrect metric recomputation
  • Wasted compute resources

Scenario 2 – Primary‑Key Updates

UPDATE user SET id = 2 WHERE id = 1;

An AFTER‑only event ({id:2}) lacks the old primary‑key value, so downstream systems cannot delete the original row. Consequences include:

  • Duplicate records
  • Unique‑key violations
  • Broken cross‑database replication

Scenario 3 – Missing Primary Key (Ambiguous Rows)

namescore
A100
A200
UPDATE t SET score = 300 WHERE name = 'A';

AFTER‑only yields two identical rows (A, 300). Without BEFORE, the system cannot determine which original row each update refers to.

Scenario 4 – Exactly‑Once Guarantees (Idempotency)

CDC pipelines often resend events due to:

  • Distributed recovery
  • Network retries
  • Checkpoint replay
  • Consumer restarts

AFTER‑only events cannot be distinguished from duplicates, breaking idempotency guarantees.

Scenario 5 – Out‑of‑Order Binlog Events

MySQL multi‑threaded replication may produce:

  • Thread 1: 100 → 120
  • Thread 2: 120 → 200

If the AFTER = 200 arrives before AFTER = 120, the system cannot know that 120 should overwrite 200 without the BEFORE images.

Scenario 6 – Data‑Lake Delete Requirements

Data‑lake update semantics typically perform:

DELETE old_row
INSERT new_row

The DELETE must match the exact BEFORE image (e.g., WHERE id=1 AND price=100). Missing BEFORE makes the DELETE impossible, leading to inconsistent data—critical in regulated financial environments.

SeaTunnel’s CDC Architecture

SeaTunnel builds on Debezium’s log parsing and defines four RowKinds:

  • INSERT
  • DELETE
  • UPDATE_BEFORE
  • UPDATE_AFTER

The MySQL‑CDC source therefore emits two tightly coupled events:

UPDATE_BEFORE → old row (DELETE)
UPDATE_AFTER  → new row (INSERT)

Processing flow

MySQL Binlog (ROW)
    |
UpdateRowsEvent (before, after)
    |
SeaTunnel MySQL‑CDC Parser
    |-------------------|
UPDATE_BEFORE      UPDATE_AFTER
(old row)          (new row)

Downstream sinks decide the operation based on RowKind. This model ensures:

  • Replayability in distributed environments
  • Recoverability
  • Order preservation
  • Compatibility with data‑lake merge semantics

The same pattern applies to OLAP databases (Doris, StarRocks), data lakes (Iceberg, Paimon, Hudi), and messaging systems (Kafka). Omitting BEFORE would cause the entire pipeline to fail.

How Data Lakes and Warehouses Consume BEFORE

Iceberg, Paimon, and Hudi support ACID semantics, but an UPDATE is a composite operation:

UPDATE event
    |
------------------------------
|                            |
DELETE old_row          INSERT new_row

The DELETE step must match the exact BEFORE image; otherwise the UPDATE cannot be applied, resulting in data inconsistency.

Real Production Case Studies

Case 1 – Duplicate Customer Records (Primary‑Key Update)

A gaming company captured only AFTER events. When users updated a composite primary key (phone number), downstream systems created duplicate customer records because the old key could not be deleted.

Case 2 – Iceberg Merge Failures

A financial institution ingested data into Iceberg using AFTER‑only CDC. DELETE operations could not match old records, leading to large volumes of inconsistent data. The pipeline was rebuilt to include BEFORE images, restoring correctness.

Final Takeaways

  • An UPDATE is a transition from an old state to a new state; both images are required for accurate CDC.
  • BEFORE + AFTER enable:
    • Detection of real changes
    • Correct handling of primary‑key updates
    • Unambiguous row identification
    • Exactly‑once processing (idempotency)
    • Out‑of‑order recovery
    • Proper delete/insert semantics in data lakes

SeaTunnel (and similar CDC frameworks) emit paired UPDATE_BEFORE and UPDATE_AFTER events to guarantee replayability, recoverability, and compatibility with downstream systems.

Understanding and preserving the BEFORE image is essential for building reliable, consistent real‑time pipelines across databases, data warehouses, and data lakes.

Back to Blog

Related posts

Read more »