Dual write problem in distributed systems

Published: (December 29, 2025 at 07:05 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Overview

The dual‑write problem occurs when a single logical operation must update two (or more) independent systems—e.g., persisting data in a database and publishing an event to a message broker such as Kafka. Because the systems do not share a transaction coordinator, achieving atomic “all‑or‑nothing” behavior is extremely difficult.

Distributed Transaction Protocols

  • Consensus‑based protocols (Paxos, Raft) provide strong consistency for state machines and are used in distributed databases and configuration stores.
  • TrueTime + Paxos (Google Spanner) offers global ACID guarantees.
  • In typical microservice architectures, such heavyweight protocols are rarely employed, leading to the dual‑write challenge.

Example Scenario

Suppose a user service needs to:

BEGIN;
INSERT INTO users (id, name) VALUES (...);
-- send a "UserCreated" event to Kafka
COMMIT;

If the database insert succeeds but the Kafka publish fails (or vice‑versa), the system ends up in an inconsistent state—one side reflects the change while the other does not. This illustrates the dual‑write problem.

Core Problem

  • No global transaction manager coordinates the two operations.
  • Network glitches, process crashes, or retries can cause partial updates.
  • Retries may generate duplicate events or out‑of‑order processing.

Consequences

Failure ModeResult
DB write succeeds, event not emittedDownstream services never learn about the new entity.
Event emitted, DB write failsConsumers act on data that does not exist.
Partial retriesDuplicate events or multiple DB inserts.

Common Solutions / Patterns

Transactional Outbox Pattern

  1. Write the business data and the event to the same database within a single transaction.
  2. A background process (or CDC tool) reads the “outbox” table and publishes events to Kafka.

Pros

  • Strong consistency between DB and messages.
  • Simple when you control both storage and messaging.

Cons

  • Adds operational complexity.
  • Consumers must handle possible duplicate events.

Change Data Capture (CDC)

  • Use a CDC tool (e.g., Debezium, Oracle GoldenGate) to monitor database changes and automatically emit events.

Pros

  • No dual‑write logic in application code.
  • Strong consistency if CDC pipeline is reliable.

Cons

  • Potential event lag.
  • Requires stable schema and reliable CDC infrastructure.

Idempotent & Retry‑Safe Design

  • Design operations to be idempotent (safe to repeat).
  • Use unique request IDs and deduplication on the consumer side.

Pros

  • Works across heterogeneous systems.

Cons

  • Still needs careful design; does not solve ordering issues.

Transactional Outbox Solution (Detailed Example)

Schema

-- Business table
CREATE TABLE orders (
    id UUID PRIMARY KEY,
    customer_id UUID NOT NULL,
    total NUMERIC NOT NULL,
    created_at TIMESTAMP DEFAULT now()
);

-- Outbox table
CREATE TABLE outbox (
    id UUID PRIMARY KEY,
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMP DEFAULT now(),
    published BOOLEAN DEFAULT FALSE
);

Application Transaction (Atomic Write)

import uuid, json
import psycopg2

conn = psycopg2.connect(...)
order_id = uuid.uuid4()
event = {
    "event_id": str(uuid.uuid4()),
    "type": "OrderCreated",
    "order_id": str(order_id)
}

try:
    with conn.cursor() as cur:
        # Insert business data
        cur.execute(
            "INSERT INTO orders (id, customer_id, total) VALUES (%s, %s, %s)",
            (order_id, "some-customer-id", 123.45)
        )
        # Insert outbox event
        cur.execute(
            """
            INSERT INTO outbox (id, event_type, payload)
            VALUES (%s, %s, %s)
            """,
            (event["event_id"], event["type"], json.dumps(event))
        )
    conn.commit()  # ✅ both rows are persisted atomically
except Exception as e:
    conn.rollback()
    raise

Outbox Processor (Async Publisher)

A lightweight worker (or CDC tool) continuously scans outbox for rows where published = FALSE. For each row:

  1. Publish the payload to Kafka.
  2. Mark the row as published = TRUE.

If the worker crashes mid‑publish, the event remains unmarked and will be retried, guaranteeing at‑least‑once delivery without losing consistency.

Goal: Prevent inconsistency when writing to a database and publishing events to a message broker, thereby solving the dual‑write problem.

Back to Blog

Related posts

Read more »