I Should Have Put Events in the Same Database as the Aggregate Root—Heres What Happened
Source: Dev.to
The Problem We Were Actually Solving
Our CQRS model kept Events in a separate Kafka cluster labeled event‑store while Aggregates lived in PostgreSQL. The outbox pattern wrote Events to Kafka via Debezium, then the read side consumed them to build materialized views. The promise was eventual consistency with zero data loss.
In practice we saw a 40 ms write path plus a 200 ms read path, and every time we scaled the read side the lag exploded because the offset‑commit cycle couldn’t keep up with the volume. At 800 RPS the materialized views were 2.3 s stale; at 250 k RPS the lag peaked at 4.2 million unprocessed events and the consumer restarted every 15 minutes with Zookeeper session timeouts. PagerDuty woke us at 3 a.m. for three nights in a row.
What We Tried First (And Why It Failed)
- Upgrade Kafka to 3.5 with transactional producers – hoped idempotent writes would tame the lag. Lag dropped 12 % but we still had 3.7 million unprocessed events.
- Move read‑side consumers to a tiered architecture – 12 k8s pods across three AZs. CPU steal climbed to 45 % on the nodes and the 99th‑percentile read latency increased to 900 ms.
- Switch from Debezium to Kafka Connect with a JDBC source – thought schema evolution was the bottleneck. Introduced 20‑second schema‑validation pauses and lag climbed to 5.1 million events.
Each attempt optimized one metric while breaking another; none addressed the fundamental latency tax of crossing two databases and two networks.
The Architecture Decision
We tore the boundary down:
- Events stored in the same PostgreSQL cluster as their aggregate roots, using
jsonbcolumns with a GIN index on{aggregate_id, event_sequence}. - Kafka replaced by logical replication slots feeding a single Go service that emits a compact binary format over internal gRPC streams for downstream consumers.
The write path became a single round‑trip: client → PostgreSQL → replication slot → gRPC.
The read path now reads events via a foreign table in the same cluster, so materialized views refresh in ~50 ms instead of 2.3 s.
Trade‑offs:
- Lost Kafka’s disk spill‑over and had to shard PostgreSQL at ~5 TB per node.
- Gained ~35 % lower p99 latency and the lag metric disappeared from the dashboard.
Total cost of ownership dropped by 28 % because we eliminated two managed services (Kafka and Debezium) and reduced the monitoring stack by six Prometheus exporters.
What The Numbers Said After
A 48‑hour load test against production traffic showed:
| Metric | Before | After |
|---|---|---|
| p99 write latency | 48 ms | 12 ms |
| p99 read latency (materialized views) | 900 ms | 45 ms |
| Replication slot lag | up to millions of events | zero |
| PostgreSQL CPU usage | 35 % | 60 % |
| Memory pressure | flat (shared buffers 25 % of RAM) | flat |
| Kafka brokers | 24 | — |
| PostgreSQL nodes | — | 12 |
| Managed‑service spend | $18k/mo | $13k/mo |
The only new failure mode was logical replication lag during a primary failover; we mitigated it by pinning one replica as a hot standby with pg_rewind, adding ~20 ms of failover time but keeping lag at zero during switchover.
What I Would Do Differently
I would not have started with separate systems. The inflection point was obvious once the numbers crossed 800 RPS; by then we had already burned six weeks and $42k in cloud bills chasing the wrong abstraction. Next time I’d put events in the same database from day one and use logical replication as the event bus rather than an external broker. I’d still keep raw event streams in an object store for replay, but I’d never again pay the network and latency tax of a second database.
The lesson isn’t that Kafka is bad—it’s that service boundaries must be justified by real data, not cargo‑cult architecture. We did the math after the fire; I’ll do the math before the next one.
Tool recommendation: https://payhip.com/ref/dev1