Dual write problem in distributed systems

Published: 1 month ago (December 29, 2025 at 07:05 AM EST)

3 min read

Source: Dev.to

Overview

The dual‑write problem occurs when a single logical operation must update two (or more) independent systems—e.g., persisting data in a database and publishing an event to a message broker such as Kafka. Because the systems do not share a transaction coordinator, achieving atomic “all‑or‑nothing” behavior is extremely difficult.

Distributed Transaction Protocols

Consensus‑based protocols (Paxos, Raft) provide strong consistency for state machines and are used in distributed databases and configuration stores.
TrueTime + Paxos (Google Spanner) offers global ACID guarantees.
In typical microservice architectures, such heavyweight protocols are rarely employed, leading to the dual‑write challenge.

Example Scenario

Suppose a user service needs to:

BEGIN;
INSERT INTO users (id, name) VALUES (...);
-- send a "UserCreated" event to Kafka
COMMIT;

If the database insert succeeds but the Kafka publish fails (or vice‑versa), the system ends up in an inconsistent state—one side reflects the change while the other does not. This illustrates the dual‑write problem.

Core Problem

No global transaction manager coordinates the two operations.
Network glitches, process crashes, or retries can cause partial updates.
Retries may generate duplicate events or out‑of‑order processing.

Consequences

Failure Mode	Result
DB write succeeds, event not emitted	Downstream services never learn about the new entity.
Event emitted, DB write fails	Consumers act on data that does not exist.
Partial retries	Duplicate events or multiple DB inserts.

Common Solutions / Patterns

Transactional Outbox Pattern

Write the business data and the event to the same database within a single transaction.
A background process (or CDC tool) reads the “outbox” table and publishes events to Kafka.

Pros

Strong consistency between DB and messages.
Simple when you control both storage and messaging.

Cons

Adds operational complexity.
Consumers must handle possible duplicate events.

Change Data Capture (CDC)

Use a CDC tool (e.g., Debezium, Oracle GoldenGate) to monitor database changes and automatically emit events.

Pros

No dual‑write logic in application code.
Strong consistency if CDC pipeline is reliable.

Cons

Potential event lag.
Requires stable schema and reliable CDC infrastructure.

Idempotent & Retry‑Safe Design

Design operations to be idempotent (safe to repeat).
Use unique request IDs and deduplication on the consumer side.

Pros

Works across heterogeneous systems.

Cons

Still needs careful design; does not solve ordering issues.

Transactional Outbox Solution (Detailed Example)

Schema

-- Business table
CREATE TABLE orders (
    id UUID PRIMARY KEY,
    customer_id UUID NOT NULL,
    total NUMERIC NOT NULL,
    created_at TIMESTAMP DEFAULT now()
);

-- Outbox table
CREATE TABLE outbox (
    id UUID PRIMARY KEY,
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMP DEFAULT now(),
    published BOOLEAN DEFAULT FALSE
);

Application Transaction (Atomic Write)

import uuid, json
import psycopg2

conn = psycopg2.connect(...)
order_id = uuid.uuid4()
event = {
    "event_id": str(uuid.uuid4()),
    "type": "OrderCreated",
    "order_id": str(order_id)
}

try:
    with conn.cursor() as cur:
        # Insert business data
        cur.execute(
            "INSERT INTO orders (id, customer_id, total) VALUES (%s, %s, %s)",
            (order_id, "some-customer-id", 123.45)
        )
        # Insert outbox event
        cur.execute(
            """
            INSERT INTO outbox (id, event_type, payload)
            VALUES (%s, %s, %s)
            """,
            (event["event_id"], event["type"], json.dumps(event))
        )
    conn.commit()  # ✅ both rows are persisted atomically
except Exception as e:
    conn.rollback()
    raise

Outbox Processor (Async Publisher)

A lightweight worker (or CDC tool) continuously scans outbox for rows where published = FALSE. For each row:

Publish the payload to Kafka.
Mark the row as published = TRUE.

If the worker crashes mid‑publish, the event remains unmarked and will be retried, guaranteeing at‑least‑once delivery without losing consistency.

Goal: Prevent inconsistency when writing to a database and publishing events to a message broker, thereby solving the dual‑write problem.

Dual write problem in distributed systems

Overview

Distributed Transaction Protocols

Example Scenario

Core Problem

Consequences

Common Solutions / Patterns

Transactional Outbox Pattern

Change Data Capture (CDC)

Idempotent & Retry‑Safe Design

Transactional Outbox Solution (Detailed Example)

Schema

Application Transaction (Atomic Write)

Outbox Processor (Async Publisher)

Related posts

User Connectivity: Two Years in Production — Lessons Learned and New Patterns

Building Centralized Documentation Across Microservices with Docusaurus, GitLab CI, and TypeDoc

The Impossible Normalization: Why Your Database Hates Biology 🧬

Why Edge Cases Matter in Distributed Systems