The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

Published: 2 weeks ago (May 26, 2026 at 11:36 PM EDT)

3 min read

Source: Dev.to

The Problem We Were Actually Solving

We ran Veltrix, a distributed event‑processing engine that powered real‑time treasure hunts across retail stores. The business needed sub‑50 ms latency for event ingestion and 99.99 % uptime during Black Friday sales.

Our first system was a Kafka Streams topology in Scala, carefully tuned with RocksDB state stores. The JVM heap was 16 GiB, G1GC was configured with -XX:MaxGCPauseMillis=50, and we had 32 vCPUs per pod. Yet, during a load test with 500 k events per second, the p99 latency spiked to 1.2 s and the JVM OOM‑d twice.

What We Tried First (And Why It Failed)

Scaling out the Kafka Streams app to six pods introduced a 300 ms tail due to the shuffle phase in the repartition topic.
Switching to exactly‑once semantics and bumping the RocksDB cache to 4 GiB caused blocking fsync on every commit, pegging the disks at 100 % iowait.
Profiling with async‑profiler showed:
- 42 % of time spent in JIT compilation stalls
- 28 % in GC pauses
- GC logs printed phrases like “Promoted 12 GB in 2.1 s”, a clear sign of imminent crashes.

We then rewrote the heavy join in C++ using RocksDB’s JNI bindings. The median latency dropped to 28 ms, but any uncaught exception in the C++ library caused the JVM process to exit with code 139. The ops team added a liveness probe that restarted the pod, but the treasure‑hunt UI refreshed and showed stale leaderboards for 8–12 seconds. Marketing sent Slack messages reading “This is unacceptable.”

The Architecture Decision

I decided to port the entire hot path to Rust. We chose:

Tokio for the async runtime
sled for an embedded KV store
flamegraph for profiling

The decision wasn’t about raw speed; it was about predictable latency and eliminating hidden GC pauses. We rewrote the event router, windowed aggregator, and leaderboard updater in ~2,800 lines of Rust. The sled store ran in‑memory with a disk flush every 500 ms to avoid the fsync disaster. The Scala layer remained for schema validation and REST endpoints, but the critical path became Rust.

What The Numbers Said After

Running the same 500 k events/sec load test after migration:

Metric	Before	After
p99 latency	1.2 s	38 ms
p99.9 latency	–	72 ms
Memory (sled peak)	–	2.1 GiB
GC time	42 % (JIT) + 28 % (GC)	0.3 % (Rust has no GC)
CPU usage (Black Friday)	–	65 % per pod, zero OOMs, zero restarts

The sled store’s SIMD‑enabled joins (LLVM) halved CPU time. Flamegraph showed the remaining time was spent on network I/O and sled compaction. The UI stayed live throughout Black Friday, and marketing stopped messaging ops directly.

What I Would Do Differently

Replace sled with a custom sharded in‑memory hash table using jemalloc to avoid occasional compaction‑induced latency spikes.
Compile with -C target-cpu=native and profile with perf on bare metal instead of Kubernetes, eliminating the 3–5 ms scheduling jitter introduced by cgroups.
Use Rust 1.75 (or newer) with the new allocator API to swap jemalloc for mimalloc without recompiling the whole binary.

The learning curve was steep—spending two weeks untangling lifetimes in the windowed aggregator—but the stability was worth every compile error.

The performance case for non‑custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference:

The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

The Architecture Decision

What The Numbers Said After

What I Would Do Differently

Related posts

Circuit Breakers: The Unsung Heroes of Resilient Microservices

Announcing Rust 1.96

Announcing Rust 1.96.0

I built a code runner for 14 languages - try to break it and test