I Spent 3 Hours Watching My Benchmark Hang, Then 6 Seconds to Fix It
Source: Dev.to
Three Hours. That’s How Long bench_column_index Ran Before I Realized It Wasn’t Going Anywhere.
I was preparing for moteDB v0.2.0 and running the usual performance suite: twelve DB instances in parallel, each doing SELECT WHERE col = ? queries while a background thread built indexes. Queries that should take milliseconds started taking minutes, then hours, then nothing.
The culprit was a single RwLock protecting every read and write to the column index. When the background thread grabbed the write lock to bulk‑insert, every query blocked. Simple as that—twelve threads fighting over one lock.
Below is what I did about it — and how I got that 3‑hour hang down to 6.6 seconds.
The Architecture That Was Killing Us
v0.1.7 had a straightforward design: one B‑Tree, one RwLock. Clean. Wrong.
SELECT WHERE col = ? → acquire read lock → traverse B‑Tree → return
Background index build → acquire write lock → bulk insert → release
When those two paths hit the same lock simultaneously, queries queued behind the writer. With twelve instances, the queue grew faster than it drained. The system looked alive — threads were running, memory was allocated — but nothing was making progress.
I needed a different model. Here’s what I landed on:
Two‑layer Architecture (RocksDB‑style)
| Component | Purpose |
|---|---|
IndexMemBuffer | In‑memory BTreeMap protected by parking_lot::RwLock (nanosecond‑level contention). Writes go here first. |
GenericBTree | On‑disk B‑Tree. Reads go through here. Writes only happen during background drain. |
drain_lock | Mutex that serializes the buffer‑to‑B‑Tree migration using try_lock so writers never block. |
tombscones | HashSet tracking deleted keys so drained buffers don’t resurrect data. |
When the memory buffer exceeds a threshold, it atomically flips to an immutable snapshot. The drain thread picks it up and builds the B‑Tree without blocking readers. New writes hit the new active buffer.
TOCTOU fix: get() now holds the same lock through both the tombstone filter and the LRU‑cache write, eliminating the race window where a key could be deleted between the two operations.
Result:
bench_column_index runtime: 3+ hours → 6.6 seconds
Three Phases of Performance Work
Beyond the core lock contention, I spent the release cycle on three performance phases.
Phase 1 – Memory Layout
Arceliminated full‑rowmemcpyon everyget().- Non‑vector tables got their own
BTreeMapinstead of the generic wrapper, saving 24 bytes per row. - At 100 K rows, that’s roughly 10 MB of memory saved without touching any query logic.
Phase 2 – Syscall Reduction
- Optimized DiskANN insertion path.
- Reused persistent file handles for
SQ8Vectors. - Every cache miss previously triggered 2 syscalls; this phase eliminated that overhead at the I/O layer.
Phase 3 – Space Index & FTS
- i‑Octree now uses Morton codes for batch loading; leaf nodes are filled in order to minimise tree splitting.
- LSM
scan_range()switched to streaming scan instead of materialising everything. - FTS switched to append‑only sharded writes, with delayed merge triggering when a shard hits 5 segments.
- Columnar predicate push‑down: decode the timestamp column first to locate rows, then decode target columns on demand — avoids decoding columns that were already filtered out.
- Spatial query row cache + removal of per‑row
HashMapallocation → 8 000× speedup on spatial range queries.
The Audit That Found 28 Problems
I ran three rounds of adversarial auditing before this release. The findings were extensive:
| Area | Issue |
|---|---|
| B‑Tree | Split leaf index out‑of‑bounds panic |
| Async index pipeline | Double‑insert causing text index panic |
| WAL compression + DiskANN | Three separate deadlocks |
close() | Not notifying background threads before checkpoint |
| Column/text index | Querying before async pipeline finished building — no fallback |
| SUM precision loss | Switched from floating‑point accumulation to a two‑pass compensation algorithm |
| BTreeMap scan | Materialising all results at once causing memory spikes |
| Primary‑key query | Index missing after restart, no fallback to scan |
| glibc arena | Concurrent crash on explicit db.close() (malloc not thread‑safe). Fixed by serialising close() calls. |
The glibc arena bug was particularly fun: under heavy concurrent close() calls the arena allocator would crash because malloc wasn’t thread‑safe in the way the code used it. The fix was simply to avoid calling close() from multiple threads simultaneously—obvious in hindsight.
Edge Devices Finally Get Love
moteDB targets embedded and edge hardware. v0.2.0 adds dedicated optimisations:
EdgeIndexConfig– DiskANN now has a bounded‑memory index configuration, limiting graph memory footprint.- FTS bounded shard counter + VersionStore eviction – Prevents memory growth during long‑running operations.
- Dead‑code cleanup – Removed ~2 200 lines, reducing binary size.
- Zero clippy warnings – Everything compiles clean.
Testing at Scale
The new test infrastructure handles the concurrency edge cases:
wait_for_indexes_ready()– pollspending_index_batchesatomic counter for deterministic index readiness.- CI adaptive data scaling – detects CI environment and automatically reduces test data volume.
749 new test cases, running under 4‑thread concurrency, completing in ~3 minutes with zero hangs.
The Numbers
| Metric | Value |
|---|---|
| Commits | 35 |
| Source files changed | 89 |
| Lines added | 28 118 |
| Lines deleted | 14 815 |
| Performance optimisations | 11 |
| Bug fixes | 21 |
| New features | 3 |
The release is on crates.io – add it with:
motedb = "0.2.0"
or run cargo add motedb.
If you’re running moteDB on edge hardware or need a database that won’t stall your queries while a background index builds, give v0.2.0 a spin!
Building indexes in the background, this one's worth upgrading to.
The benchmark suite no longer hangs.