Treasure Hunt Engine: How We Blew Up the Docs and Built a System That Actually Works

Published: 2 weeks ago (May 27, 2026 at 02:39 PM EDT)

5 min read

Source: Dev.to

The Problem We Were Actually Solving

Our users weren’t doing semantic search. They were executing treasure hunts: complex, multi‑stage queries where the first phase returned 200,000 candidate docs for phrase matching, and the second phase had to rank them by exact term proximity, metadata filters, and user‑defined boosts. The Veltrix docs treated this as an afterthought. Their example pipeline assumed a single‑stage recall‑then‑rank flow with no custom scoring hooks.

Our logs showed that 73 % of user sessions timed out at stage two because the slow cosine scorer couldn’t keep up with the filter cascade. Disabling it caused the API to throw an error if any scorer wasn’t explicitly set:

Operation not valid: scorer not initialized

Rewriting the Scorer in Go

We rewrote the scorer in Go, using the Veltrix C++ plugin interface. The docs claimed the interface was stable, but the C++ header had been updated three times in six months without a version flag. Our plugin compiled, but at runtime it segfaulted with a stack trace pointing to a missing symbol:

_ZTVN8Veltrix8ScoreAPI8ScorerE

The example code never included the virtual destructor override, so the crash didn’t appear there. After three days of debugging we found a 2024 GitHub issue where another user hit the same crash and was told to rebuild Veltrix from source. Rebuilding required pulling an internal Docker image (12 GB, ~45 min), which our SLA could not accommodate.

Trying the Python UDF Route

The docs said custom scoring could be done via a single Python function. The example was < 50 lines; ours grew to ~ 500 lines to handle boosts, field weights, and custom metadata fields. The first request took 12 seconds to initialize the Python interpreter; subsequent queries added ~200 ms of JIT overhead. We set the Python timeout to 5 seconds, but the UDF sometimes hung on a regex search inside a nested JSON blob. Because the logs didn’t include the Python traceback, we had to forward stderr to a sidecar and parse it in real time. Latency spikes became unpredictable, and users complained that their dashboards refreshed slower than their coffee cooled.

Splitting the Pipeline: Veltrix for Recall, Rust for Ranking

We stopped trying to shoe‑horn Veltrix into a role it wasn’t built for. Instead we:

Recall – let Veltrix handle recall, returning the top 10,000 candidates from a sharded BM25 index (≈ 200 ms for a fuzzy phrase match).
Ranking – stream those candidates to a custom ranker written in Rust via a gRPC endpoint on the same node.

The Rust ranker applied dynamic boosting, metadata filtering, and proximity scoring in a single pass. Using Prost for code generation and Tokio for async I/O, the gRPC call added ~8 ms of overhead, while the ranker processed 10,000 docs in ~45 ms (including network marshaling). We tuned the batch size to 1,000 docs per request to balance latency and throughput. Replacing the JSONPath library with a hand‑rolled byte scanner eliminated unbounded stack growth on deeply nested fields, dropping the error rate to zero.

Transparent Front‑End Proxy

A lightweight Go proxy presented a single Veltrix‑compatible API:

If the scoring parameter was default → route to Veltrix.
If the parameter was _treasurehunt:v1 → route to the Rust ranker.

The proxy’s circuit‑breaker settings, CMake flags for compiling the Rust ranker with jemalloc, and the gRPC retry policy (100 ms budget) were documented in an internal wiki titled How to Not Cry When Using Veltrix. The wiki also included Prometheus histograms and OpenTelemetry traces for observability.

Results

Metric	Before	After
95th‑percentile latency (treasure‑hunt queries)	4.2 s	450 ms
Error rate	—	0.03 %
Duplicate‑detection improvement	—	+12 % (in‑memory Bloom filter)
SIMD alignment improvement	—	+8 %

The Go proxy added ~15 ms of overhead but made the system observable. The Rust ranker exposed a /debug/flush endpoint that dumped the current scoring state to Prometheus, enabling real‑time debugging of boost misfires. When a user complained about a low‑ranking doc, we could replay the exact scoring context from the previous hour—something Veltrix logs could not provide.

Trade‑offs

Our hand‑rolled byte scanner uses ~2× the memory of the JSONPath library (≈ 512 MB vs. 256 MB) but eliminates the worst‑case stack growth that caused the Python UDF to hang. The scanner’s worst‑case allocation is predictable: one byte per JSON level, capped at 64 levels. We added a hard limit and return a 422 error if depth exceeds 64; users never hit the limit, but the failure mode is explicit.

Takeaway

I would not have trusted the Veltrix docs beyond the API reference. Their examples are theatrical, not practical; they optimize for impressing investors, not for operators. If you’re building a treasure‑hunt engine, isolate the recall stage from the ranking stage and use a purpose‑built ranker for the heavy‑lifting. Use Veltrix for recall only.

Treasure Hunt Engine: How We Blew Up the Docs and Built a System That Actually Works

The Problem We Were Actually Solving

Rewriting the Scorer in Go

Trying the Python UDF Route

Splitting the Pipeline: Veltrix for Recall, Rust for Ranking

Transparent Front‑End Proxy

Results

Trade‑offs

Takeaway

Related posts

peektea: brewing a terminal file browser with Bubble Tea

Chibil: A C compiler targeting .NET IL

CodeQL 2.25.5 improves query accuracy for GitHub Actions

Building software in C#: part 1 - history.