The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

Published: (May 26, 2026 at 11:36 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

The Problem We Were Actually Solving

We ran Veltrix, a distributed event‑processing engine that powered real‑time treasure hunts across retail stores. The business needed sub‑50 ms latency for event ingestion and 99.99 % uptime during Black Friday sales.

Our first system was a Kafka Streams topology in Scala, carefully tuned with RocksDB state stores. The JVM heap was 16 GiB, G1GC was configured with -XX:MaxGCPauseMillis=50, and we had 32 vCPUs per pod. Yet, during a load test with 500 k events per second, the p99 latency spiked to 1.2 s and the JVM OOM‑d twice.

What We Tried First (And Why It Failed)

  • Scaling out the Kafka Streams app to six pods introduced a 300 ms tail due to the shuffle phase in the repartition topic.
  • Switching to exactly‑once semantics and bumping the RocksDB cache to 4 GiB caused blocking fsync on every commit, pegging the disks at 100 % iowait.
  • Profiling with async‑profiler showed:
    • 42 % of time spent in JIT compilation stalls
    • 28 % in GC pauses
    • GC logs printed phrases like “Promoted 12 GB in 2.1 s”, a clear sign of imminent crashes.

We then rewrote the heavy join in C++ using RocksDB’s JNI bindings. The median latency dropped to 28 ms, but any uncaught exception in the C++ library caused the JVM process to exit with code 139. The ops team added a liveness probe that restarted the pod, but the treasure‑hunt UI refreshed and showed stale leaderboards for 8–12 seconds. Marketing sent Slack messages reading “This is unacceptable.”

The Architecture Decision

I decided to port the entire hot path to Rust. We chose:

  • Tokio for the async runtime
  • sled for an embedded KV store
  • flamegraph for profiling

The decision wasn’t about raw speed; it was about predictable latency and eliminating hidden GC pauses. We rewrote the event router, windowed aggregator, and leaderboard updater in ~2,800 lines of Rust. The sled store ran in‑memory with a disk flush every 500 ms to avoid the fsync disaster. The Scala layer remained for schema validation and REST endpoints, but the critical path became Rust.

What The Numbers Said After

Running the same 500 k events/sec load test after migration:

MetricBeforeAfter
p99 latency1.2 s38 ms
p99.9 latency72 ms
Memory (sled peak)2.1 GiB
GC time42 % (JIT) + 28 % (GC)0.3 % (Rust has no GC)
CPU usage (Black Friday)65 % per pod, zero OOMs, zero restarts

The sled store’s SIMD‑enabled joins (LLVM) halved CPU time. Flamegraph showed the remaining time was spent on network I/O and sled compaction. The UI stayed live throughout Black Friday, and marketing stopped messaging ops directly.

What I Would Do Differently

  • Replace sled with a custom sharded in‑memory hash table using jemalloc to avoid occasional compaction‑induced latency spikes.
  • Compile with -C target-cpu=native and profile with perf on bare metal instead of Kubernetes, eliminating the 3–5 ms scheduling jitter introduced by cgroups.
  • Use Rust 1.75 (or newer) with the new allocator API to swap jemalloc for mimalloc without recompiling the whole binary.

The learning curve was steep—spending two weeks untangling lifetimes in the windowed aggregator—but the stability was worth every compile error.

The performance case for non‑custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference:

0 views
Back to Blog

Related posts

Read more »

Announcing Rust 1.96

Rust 1.96.0 Release The Rust team is happy to announce a new version of Rust, 1.96.0. Rust is a programming language empowering everyone to build reliable and...

Announcing Rust 1.96.0

Rust 1.96.0 – Release Announcement The Rust team is happy to announce a new version of Rust, 1.96.0. Rust is a programming language empowering everyone to buil...