I Spent 3 Hours Watching My Benchmark Hang, Then 6 Seconds to Fix It

Published: 3 weeks ago (May 14, 2026 at 09:30 AM EDT)

5 min read

Source: Dev.to

Three Hours. That’s How Long `bench_column_index` Ran Before I Realized It Wasn’t Going Anywhere.

I was preparing for moteDB v0.2.0 and running the usual performance suite: twelve DB instances in parallel, each doing SELECT WHERE col = ? queries while a background thread built indexes. Queries that should take milliseconds started taking minutes, then hours, then nothing.

The culprit was a single RwLock protecting every read and write to the column index. When the background thread grabbed the write lock to bulk‑insert, every query blocked. Simple as that—twelve threads fighting over one lock.

Below is what I did about it — and how I got that 3‑hour hang down to 6.6 seconds.

The Architecture That Was Killing Us

v0.1.7 had a straightforward design: one B‑Tree, one RwLock. Clean. Wrong.

SELECT WHERE col = ?      → acquire read lock  → traverse B‑Tree → return
Background index build    → acquire write lock → bulk insert    → release

When those two paths hit the same lock simultaneously, queries queued behind the writer. With twelve instances, the queue grew faster than it drained. The system looked alive — threads were running, memory was allocated — but nothing was making progress.

I needed a different model. Here’s what I landed on:

Two‑layer Architecture (RocksDB‑style)

Component	Purpose
`IndexMemBuffer`	In‑memory `BTreeMap` protected by `parking_lot::RwLock` (nanosecond‑level contention). Writes go here first.
`GenericBTree`	On‑disk B‑Tree. Reads go through here. Writes only happen during background drain.
`drain_lock`	`Mutex` that serializes the buffer‑to‑B‑Tree migration using `try_lock` so writers never block.
`tombscones`	`HashSet` tracking deleted keys so drained buffers don’t resurrect data.

When the memory buffer exceeds a threshold, it atomically flips to an immutable snapshot. The drain thread picks it up and builds the B‑Tree without blocking readers. New writes hit the new active buffer.

TOCTOU fix: get() now holds the same lock through both the tombstone filter and the LRU‑cache write, eliminating the race window where a key could be deleted between the two operations.

Result:

bench_column_index runtime: 3+ hours → 6.6 seconds

Three Phases of Performance Work

Beyond the core lock contention, I spent the release cycle on three performance phases.

Phase 1 – Memory Layout

Arc eliminated full‑row memcpy on every get().
Non‑vector tables got their own BTreeMap instead of the generic wrapper, saving 24 bytes per row.
At 100 K rows, that’s roughly 10 MB of memory saved without touching any query logic.

Phase 2 – Syscall Reduction

Optimized DiskANN insertion path.
Reused persistent file handles for SQ8Vectors.
Every cache miss previously triggered 2 syscalls; this phase eliminated that overhead at the I/O layer.

Phase 3 – Space Index & FTS

i‑Octree now uses Morton codes for batch loading; leaf nodes are filled in order to minimise tree splitting.
LSM scan_range() switched to streaming scan instead of materialising everything.
FTS switched to append‑only sharded writes, with delayed merge triggering when a shard hits 5 segments.
Columnar predicate push‑down: decode the timestamp column first to locate rows, then decode target columns on demand — avoids decoding columns that were already filtered out.
Spatial query row cache + removal of per‑row HashMap allocation → 8 000× speedup on spatial range queries.

The Audit That Found 28 Problems

I ran three rounds of adversarial auditing before this release. The findings were extensive:

Area	Issue
B‑Tree	Split leaf index out‑of‑bounds panic
Async index pipeline	Double‑insert causing text index panic
WAL compression + DiskANN	Three separate deadlocks
`close()`	Not notifying background threads before checkpoint
Column/text index	Querying before async pipeline finished building — no fallback
SUM precision loss	Switched from floating‑point accumulation to a two‑pass compensation algorithm
BTreeMap scan	Materialising all results at once causing memory spikes
Primary‑key query	Index missing after restart, no fallback to scan
glibc arena	Concurrent crash on explicit `db.close()` (malloc not thread‑safe). Fixed by serialising `close()` calls.

The glibc arena bug was particularly fun: under heavy concurrent close() calls the arena allocator would crash because malloc wasn’t thread‑safe in the way the code used it. The fix was simply to avoid calling close() from multiple threads simultaneously—obvious in hindsight.

Edge Devices Finally Get Love

moteDB targets embedded and edge hardware. v0.2.0 adds dedicated optimisations:

EdgeIndexConfig – DiskANN now has a bounded‑memory index configuration, limiting graph memory footprint.
FTS bounded shard counter + VersionStore eviction – Prevents memory growth during long‑running operations.
Dead‑code cleanup – Removed ~2 200 lines, reducing binary size.
Zero clippy warnings – Everything compiles clean.

Testing at Scale

The new test infrastructure handles the concurrency edge cases:

wait_for_indexes_ready() – polls pending_index_batches atomic counter for deterministic index readiness.
CI adaptive data scaling – detects CI environment and automatically reduces test data volume.

749 new test cases, running under 4‑thread concurrency, completing in ~3 minutes with zero hangs.

The Numbers

Metric	Value
Commits	35
Source files changed	89
Lines added	28 118
Lines deleted	14 815
Performance optimisations	11
Bug fixes	21
New features	3

The release is on crates.io – add it with:

motedb = "0.2.0"

or run cargo add motedb.

If you’re running moteDB on edge hardware or need a database that won’t stall your queries while a background index builds, give v0.2.0 a spin!

Building indexes in the background, this one's worth upgrading to.

The benchmark suite no longer hangs.

I Spent 3 Hours Watching My Benchmark Hang, Then 6 Seconds to Fix It

Three Hours. That’s How Long `bench_column_index` Ran Before I Realized It Wasn’t Going Anywhere.

The Architecture That Was Killing Us

Two‑layer Architecture (RocksDB‑style)

Three Phases of Performance Work

Phase 1 – Memory Layout

Phase 2 – Syscall Reduction

Phase 3 – Space Index & FTS

The Audit That Found 28 Problems

Edge Devices Finally Get Love

Testing at Scale

The Numbers

Related posts

I’ve Given Up on Bun. I’m Removing It from SuperRails

Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs)

Rewrite Bun in Rust has been merged

The limits of Rust, or why you should probably not follow Amazon and Cloudflare

Three Hours. That’s How Long bench_column_index Ran Before I Realized It Wasn’t Going Anywhere.

The Architecture That Was Killing Us

Two‑layer Architecture (RocksDB‑style)

Three Phases of Performance Work

Phase 1 – Memory Layout

Phase 2 – Syscall Reduction

Phase 3 – Space Index & FTS

The Audit That Found 28 Problems

Edge Devices Finally Get Love

Testing at Scale

The Numbers

Related posts

I’ve Given Up on Bun. I’m Removing It from SuperRails

Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs)

Rewrite Bun in Rust has been merged

The limits of Rust, or why you should probably not follow Amazon and Cloudflare

Three Hours. That’s How Long `bench_column_index` Ran Before I Realized It Wasn’t Going Anywhere.

Phase 1 – Memory Layout

Phase 2 – Syscall Reduction

Phase 3 – Space Index & FTS