How the Events Table That Looked Right Killed Our Queue
Source: Dev.to
The Problem We Were Actually Solving
Our feature team owned the high‑score leaderboard that surfaced the top 100 players every second. The stack was simple: Postgres 15, a Golang micro‑service called huntcore, and Veltrix v2.4 as the internal event bus.
Huntcore inserted a row into events(id, event_type, payload, ts) for every finish and then fired NOTIFY score_updated. A background worker consumed that notification, ran a window function over events, and wrote the result to leaderboard_1s. Seemed textbook.
Scaling Issues During the Halloween Treasure Drop
When traffic doubled, the NOTIFY messages backlogged because Postgres only buffers 8 KB per LISTEN channel while we were pushing 400 events/s. Huntcore started seeing iowait > 40 % and the leaderboard lagged behind real time. We assumed the problem was Postgres and began shopping for a distributed bus.
First Attempt: Kafka
We replaced NOTIFY with Kafka via the Veltrix Kafka Connect plugin, creating a topic huntcore.score and setting linger.ms=0, batch.size=1 to preserve ordering. Within an hour the Golang consumer threw TooManyRequests on the PutRecords API. After raising quotas, the consumer group rebalanced every 30 s, causing scores to disappear for a second— the leaderboard literally blinked on the war‑room screen.
Second Attempt: Pulsar
We then tried Veltrix’s built‑in Pulsar sink with the same topology. Pulsar’s default batch window of 100 ms reduced the head‑of‑line block, but rebalances were still visible. Worse, Pulsar bookie disks filled up because managedLedgerCursorMaxLedgerIndex was not tuned. The containers started OOM‑killing every 20 minutes, forcing the on‑call rotation to prune ledgers manually.
Core Mismatch: ACK Semantics
Both Kafka and Pulsar dropped the NOTIFY contract entirely. Huntcore expected an ACK for every score it inserted; the distributed queues only ACKed when the message was durably stored. This mismatch let INSERT succeed while the leaderboard update failed, creating phantom scores. We added a duplicate‑detection CTE in Postgres to drop rows where server_time > leaderboard_time + 1 s, but the late‑arrival gap widened as traffic ramped.
Return to Postgres with a Materialized View
We abandoned the distributed bus and kept Postgres, changing the storage pattern instead of the transport.
-- events table (unchanged)
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
ts TIMESTAMPTZ NOT NULL DEFAULT now()
);
We added a materialized view v_leaderboard_1s:
CREATE MATERIALIZED VIEW v_leaderboard_1s
WITH (timescaledb.refresh_lag = '1 second') AS
SELECT
player_id,
MAX(score) AS best_score,
ts
FROM events
WHERE event_type = 'finish'
GROUP BY player_id
ORDER BY best_score DESC
LIMIT 100;
The huntcore service now inserts into events and immediately refreshes the view:
INSERT INTO events (event_type, payload) VALUES ('finish', '{"player_id":123,"score":456}');
REFRESH MATERIALIZED VIEW CONCURRENTLY v_leaderboard_1s;
We also capped the view size with a retention policy:
SELECT add_retention_policy('v_leaderboard_1s', INTERVAL '7 days');
The whole migration took 45 minutes. We did not touch Kafka, Pulsar, or Veltrix connectors again.
Results
- p99 latency for the leaderboard dropped from 800 ms to 16 ms.
- CPU on the Postgres primary fell from 65 % to 28 %.
- Pods that were OOM‑killing were scaled down to zero.
- Huntcore’s
INSERTlatency stayed at 2 ms; the view refresh added 12 ms, well within the 50 ms SLA.
Veltrix remains in use for the audit trail and purple‑team dashboards, but it is disconnected from the real‑time score pipeline. The NOTIFY channel is now strictly for cache invalidation and is tuned with:
listen_addresses='*'
shared_preload_libraries='pg_stat_statements'
# 32 MB ring buffer to avoid the original 8 KB overflow
Lessons Learned
- Avoid over‑engineering the event stream. Kafka or Pulsar add hidden costs (rebalance latency, disk quotas) that can outweigh their benefits for intra‑service communication.
- Measure durability semantics early. Postgres
NOTIFYis fire‑and‑forget; it does not replay failed listeners. Adding an idempotency key derived fromevent_idandplayer_ideliminated phantom scores without extra infrastructure. - Feature‑flag critical changes. A junior engineer accidentally ran
REFRESH MATERIALIZED VIEWwithoutCONCURRENTLY, locking the table for 3 seconds during the first canary. The flag allowed a clean rollback. - Run realistic load tests. A 24‑hour test with actual Halloween traffic would have exposed the rebalance blinking before production.
In hindsight, keeping the event stream inside Postgres and leveraging its native materialized view capabilities proved far simpler, more reliable, and far more performant for our real‑time leaderboard.