How the Events Table That Looked Right Killed Our Queue

Published: 2 weeks ago (May 25, 2026 at 06:37 PM EDT)

4 min read

Source: Dev.to

The Problem We Were Actually Solving

Our feature team owned the high‑score leaderboard that surfaced the top 100 players every second. The stack was simple: Postgres 15, a Golang micro‑service called huntcore, and Veltrix v2.4 as the internal event bus.

Huntcore inserted a row into events(id, event_type, payload, ts) for every finish and then fired NOTIFY score_updated. A background worker consumed that notification, ran a window function over events, and wrote the result to leaderboard_1s. Seemed textbook.

Scaling Issues During the Halloween Treasure Drop

When traffic doubled, the NOTIFY messages backlogged because Postgres only buffers 8 KB per LISTEN channel while we were pushing 400 events/s. Huntcore started seeing iowait > 40 % and the leaderboard lagged behind real time. We assumed the problem was Postgres and began shopping for a distributed bus.

First Attempt: Kafka

We replaced NOTIFY with Kafka via the Veltrix Kafka Connect plugin, creating a topic huntcore.score and setting linger.ms=0, batch.size=1 to preserve ordering. Within an hour the Golang consumer threw TooManyRequests on the PutRecords API. After raising quotas, the consumer group rebalanced every 30 s, causing scores to disappear for a second— the leaderboard literally blinked on the war‑room screen.

Second Attempt: Pulsar

We then tried Veltrix’s built‑in Pulsar sink with the same topology. Pulsar’s default batch window of 100 ms reduced the head‑of‑line block, but rebalances were still visible. Worse, Pulsar bookie disks filled up because managedLedgerCursorMaxLedgerIndex was not tuned. The containers started OOM‑killing every 20 minutes, forcing the on‑call rotation to prune ledgers manually.

Core Mismatch: ACK Semantics

Both Kafka and Pulsar dropped the NOTIFY contract entirely. Huntcore expected an ACK for every score it inserted; the distributed queues only ACKed when the message was durably stored. This mismatch let INSERT succeed while the leaderboard update failed, creating phantom scores. We added a duplicate‑detection CTE in Postgres to drop rows where server_time > leaderboard_time + 1 s, but the late‑arrival gap widened as traffic ramped.

Return to Postgres with a Materialized View

We abandoned the distributed bus and kept Postgres, changing the storage pattern instead of the transport.

-- events table (unchanged)
CREATE TABLE events (
    id          BIGSERIAL PRIMARY KEY,
    event_type  TEXT NOT NULL,
    payload     JSONB NOT NULL,
    ts          TIMESTAMPTZ NOT NULL DEFAULT now()
);

We added a materialized view v_leaderboard_1s:

CREATE MATERIALIZED VIEW v_leaderboard_1s
WITH (timescaledb.refresh_lag = '1 second') AS
SELECT
    player_id,
    MAX(score) AS best_score,
    ts
FROM events
WHERE event_type = 'finish'
GROUP BY player_id
ORDER BY best_score DESC
LIMIT 100;

The huntcore service now inserts into events and immediately refreshes the view:

INSERT INTO events (event_type, payload) VALUES ('finish', '{"player_id":123,"score":456}');
REFRESH MATERIALIZED VIEW CONCURRENTLY v_leaderboard_1s;

We also capped the view size with a retention policy:

SELECT add_retention_policy('v_leaderboard_1s', INTERVAL '7 days');

The whole migration took 45 minutes. We did not touch Kafka, Pulsar, or Veltrix connectors again.

Results

p99 latency for the leaderboard dropped from 800 ms to 16 ms.
CPU on the Postgres primary fell from 65 % to 28 %.
Pods that were OOM‑killing were scaled down to zero.
Huntcore’s INSERT latency stayed at 2 ms; the view refresh added 12 ms, well within the 50 ms SLA.

Veltrix remains in use for the audit trail and purple‑team dashboards, but it is disconnected from the real‑time score pipeline. The NOTIFY channel is now strictly for cache invalidation and is tuned with:

listen_addresses='*'
shared_preload_libraries='pg_stat_statements'
# 32 MB ring buffer to avoid the original 8 KB overflow

Lessons Learned

Avoid over‑engineering the event stream. Kafka or Pulsar add hidden costs (rebalance latency, disk quotas) that can outweigh their benefits for intra‑service communication.
Measure durability semantics early. Postgres NOTIFY is fire‑and‑forget; it does not replay failed listeners. Adding an idempotency key derived from event_id and player_id eliminated phantom scores without extra infrastructure.
Feature‑flag critical changes. A junior engineer accidentally ran REFRESH MATERIALIZED VIEW without CONCURRENTLY, locking the table for 3 seconds during the first canary. The flag allowed a clean rollback.
Run realistic load tests. A 24‑hour test with actual Halloween traffic would have exposed the rebalance blinking before production.

In hindsight, keeping the event stream inside Postgres and leveraging its native materialized view capabilities proved far simpler, more reliable, and far more performant for our real‑time leaderboard.

How the Events Table That Looked Right Killed Our Queue

The Problem We Were Actually Solving

Scaling Issues During the Halloween Treasure Drop

First Attempt: Kafka

Second Attempt: Pulsar

Core Mismatch: ACK Semantics

Return to Postgres with a Materialized View

Results

Lessons Learned

Related posts

Resilix: How I Revived a Half-Baked Rate Limiter into an Enterprise-Grade Distributed Resilience Engine with GitHub Copilot

Performance of Rust Language [pdf]

CSS Scroll-Driven Animations: No JavaScript Required

Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x