I Spent 3 Hours Watching My Benchmark Hang, Then 6 Seconds to Fix It

Published: (May 14, 2026 at 09:30 AM EDT)
5 min read
Source: Dev.to

Source: Dev.to

Three Hours. That’s How Long bench_column_index Ran Before I Realized It Wasn’t Going Anywhere.

I was preparing for moteDB v0.2.0 and running the usual performance suite: twelve DB instances in parallel, each doing SELECT WHERE col = ? queries while a background thread built indexes. Queries that should take milliseconds started taking minutes, then hours, then nothing.

The culprit was a single RwLock protecting every read and write to the column index. When the background thread grabbed the write lock to bulk‑insert, every query blocked. Simple as that—twelve threads fighting over one lock.

Below is what I did about it — and how I got that 3‑hour hang down to 6.6 seconds.


The Architecture That Was Killing Us

v0.1.7 had a straightforward design: one B‑Tree, one RwLock. Clean. Wrong.

SELECT WHERE col = ?      → acquire read lock  → traverse B‑Tree → return
Background index build    → acquire write lock → bulk insert    → release

When those two paths hit the same lock simultaneously, queries queued behind the writer. With twelve instances, the queue grew faster than it drained. The system looked alive — threads were running, memory was allocated — but nothing was making progress.

I needed a different model. Here’s what I landed on:

Two‑layer Architecture (RocksDB‑style)

ComponentPurpose
IndexMemBufferIn‑memory BTreeMap protected by parking_lot::RwLock (nanosecond‑level contention). Writes go here first.
GenericBTreeOn‑disk B‑Tree. Reads go through here. Writes only happen during background drain.
drain_lockMutex that serializes the buffer‑to‑B‑Tree migration using try_lock so writers never block.
tombsconesHashSet tracking deleted keys so drained buffers don’t resurrect data.

When the memory buffer exceeds a threshold, it atomically flips to an immutable snapshot. The drain thread picks it up and builds the B‑Tree without blocking readers. New writes hit the new active buffer.

TOCTOU fix: get() now holds the same lock through both the tombstone filter and the LRU‑cache write, eliminating the race window where a key could be deleted between the two operations.

Result:

bench_column_index runtime: 3+ hours → 6.6 seconds

Three Phases of Performance Work

Beyond the core lock contention, I spent the release cycle on three performance phases.

Phase 1 – Memory Layout

  • Arc eliminated full‑row memcpy on every get().
  • Non‑vector tables got their own BTreeMap instead of the generic wrapper, saving 24 bytes per row.
  • At 100 K rows, that’s roughly 10 MB of memory saved without touching any query logic.

Phase 2 – Syscall Reduction

  • Optimized DiskANN insertion path.
  • Reused persistent file handles for SQ8Vectors.
  • Every cache miss previously triggered 2 syscalls; this phase eliminated that overhead at the I/O layer.

Phase 3 – Space Index & FTS

  • i‑Octree now uses Morton codes for batch loading; leaf nodes are filled in order to minimise tree splitting.
  • LSM scan_range() switched to streaming scan instead of materialising everything.
  • FTS switched to append‑only sharded writes, with delayed merge triggering when a shard hits 5 segments.
  • Columnar predicate push‑down: decode the timestamp column first to locate rows, then decode target columns on demand — avoids decoding columns that were already filtered out.
  • Spatial query row cache + removal of per‑row HashMap allocation → 8 000× speedup on spatial range queries.

The Audit That Found 28 Problems

I ran three rounds of adversarial auditing before this release. The findings were extensive:

AreaIssue
B‑TreeSplit leaf index out‑of‑bounds panic
Async index pipelineDouble‑insert causing text index panic
WAL compression + DiskANNThree separate deadlocks
close()Not notifying background threads before checkpoint
Column/text indexQuerying before async pipeline finished building — no fallback
SUM precision lossSwitched from floating‑point accumulation to a two‑pass compensation algorithm
BTreeMap scanMaterialising all results at once causing memory spikes
Primary‑key queryIndex missing after restart, no fallback to scan
glibc arenaConcurrent crash on explicit db.close() (malloc not thread‑safe). Fixed by serialising close() calls.

The glibc arena bug was particularly fun: under heavy concurrent close() calls the arena allocator would crash because malloc wasn’t thread‑safe in the way the code used it. The fix was simply to avoid calling close() from multiple threads simultaneously—obvious in hindsight.


Edge Devices Finally Get Love

moteDB targets embedded and edge hardware. v0.2.0 adds dedicated optimisations:

  • EdgeIndexConfig – DiskANN now has a bounded‑memory index configuration, limiting graph memory footprint.
  • FTS bounded shard counter + VersionStore eviction – Prevents memory growth during long‑running operations.
  • Dead‑code cleanup – Removed ~2 200 lines, reducing binary size.
  • Zero clippy warnings – Everything compiles clean.

Testing at Scale

The new test infrastructure handles the concurrency edge cases:

  • wait_for_indexes_ready() – polls pending_index_batches atomic counter for deterministic index readiness.
  • CI adaptive data scaling – detects CI environment and automatically reduces test data volume.

749 new test cases, running under 4‑thread concurrency, completing in ~3 minutes with zero hangs.


The Numbers

MetricValue
Commits35
Source files changed89
Lines added28 118
Lines deleted14 815
Performance optimisations11
Bug fixes21
New features3

The release is on crates.io – add it with:

motedb = "0.2.0"

or run cargo add motedb.

If you’re running moteDB on edge hardware or need a database that won’t stall your queries while a background index builds, give v0.2.0 a spin!

Building indexes in the background, this one's worth upgrading to.

The benchmark suite no longer hangs.
0 views
Back to Blog

Related posts

Read more »

Rewrite Bun in Rust has been merged

It passes Bun's pre-existing test suite on all platforms and fixes several memory leaks and flaky tests, the binary size shrinks by 3 MB – 8 MB, the benchmarks...