Sift: Local Hybrid Search Without the Infrastructure Tax
Source: Dev.to
Overview
sift is a local Rust CLI for document retrieval. Point it at a directory, ask a question, and it runs a full hybrid search pipeline—BM25, dense vector, fusion, optional reranking—and returns ranked results. No daemon, no background indexer, no cloud. One binary.
It’s built for agents and developers who need reliable, repeatable search over raw codebases, docs, and mixed‑format corpora without spinning up infrastructure.
You can install it now on macOS, Windows, and Linux.
The retrieval pipeline
Every query runs through four stages:
- Expansion – query variants are generated to broaden recall before retrieval begins.
- Retrieval – BM25 (keyword), phrase match, and dense vector retrieval run against the corpus. Each method captures different signal.
- Fusion – results are merged using Reciprocal Rank Fusion (RRF), balancing signal across retrieval methods without manual weight tuning.
- Reranking – optional local LLM reranking via Qwen applies semantic disambiguation on the fused candidate set.
Each stage is independently tunable. Skip the vector pass if you only need BM25 speed. Run the full stack for best precision.
Architecture
The implementation is split into domain and adapters:
- Domain objects model search plans, candidates, and scoring outputs.
- Adapters implement the concrete BM25, phrase, vector, and reranking backends.
A shared search service executes the same strategy model for CLI, benchmark, and evaluation flows—nothing changes between a dev run and a CI eval pass.
Performance highlights
- SIMD‑accelerated dot‑product for vector scoring on CPU‑heavy workloads.
- Zig‑inspired incremental cache – a two‑layer design borrowed from Zig’s build system. A manifest store tracks filesystem metadata (inode, mtime, size) mapped to BLAKE3 content hashes, so
siftknows exactly which files have changed without re‑reading them. A content‑addressable blob store holds pre‑extracted text, pre‑computed BM25 term frequencies, and pre‑embedded dense vectors—meaning repeat queries never touch the neural network at all. Identical files across different projects share a single blob entry. - Per‑query embedding reuse across multi‑stage pipelines.
- Mapped I/O and tight tokenization hot loops to keep latency low on large corpora.
One concrete trade‑off during development: lowering the embedding max_length from 48 to 40 recovered latency budget while keeping quality above the BM25 baseline—a good example of evidence‑driven tuning beating guesswork.
Full internals are documented in ARCHITECTURE.md.
Evaluation
Comparative strategy run over 5,185 SciFact documents (~7.8 MB) on an AMD Ryzen Threadripper 3960X:
| Strategy | nDCG@10 | MRR@10 | Recall@10 | p50 (ms) |
|---|---|---|---|---|
| bm25 | 0.7262 | 0.7000 | 0.8000 | 5.41 |
| legacy‑hybrid | 0.7893 | 0.7250 | 1.0000 | 50.29 |
| page‑index | 0.7000 | 0.6667 | 0.8000 | 16.79 |
| page‑index‑hybrid | 0.5701 | 0.4367 | 1.0000 | 41.09 |
| page‑index‑llm | 0.7893 | 0.7250 | 1.0000 | 41.28 |
| page‑index‑qwen | 0.7893 | 0.7250 | 1.0000 | 41.18 |
| vector | 0.8262 | 0.7667 | 1.0000 | 25.94 |
Key takeaways
- BM25 at 5.41 ms p50 is the right default for latency‑constrained cases where keyword recall is sufficient.
- Vector achieves the best nDCG@10 (0.8262) and perfect recall at 25.94 ms — the most balanced strategy for most workloads.
- LLM reranking (page‑index‑llm, page‑index‑qwen) matches legacy‑hybrid quality at comparable speed, validating the local Qwen path as a practical alternative to heavier hybrid pipelines.
- page‑index‑hybrid is the only strategy that underperforms BM25 on nDCG, reminding us that added complexity doesn’t always improve quality.
Cache hit rates (100/0/100 %) confirm the caching layer works correctly across all strategies. Verbose output (-v, -vv) surfaces cache hit rates, phase timings, and ranking metadata directly in the CLI.
Why this matters for agents
For agents, latency and reliability are requirements, not nice‑to‑haves. Tooling loops fail hard when search is slow, drops context, or depends on services that may be unavailable.
sift removes that friction: retrieval is local, deterministic, and cheap to repeat. No daemon to health‑check. No embedding service to rate‑limit against. No cloud dependency to manage. The binary ships with Homebrew and static Linux artifact support, so agents can rely on a pinned version without environment drift.
How it was built
The project shipped in a focused, nearly uninterrupted 24‑hour push—implementation, evaluation design, benchmarking, performance tightening, packaging, and release preparation in one sustained flow. Every major unit had acceptance criteria and measurable evidence attached before it was marked done.
What made that pace possible is something I’m not ready to discuss in detail yet. But sift is the first real proof that it works at speed, under real constraints, without cutting corners. More on that soon.
Get started
- README — installation and basic usage
- CONFIGURATION — strategy and model settings
- EVALUATION — running your own corpus evals
- ARCHITECTURE — internals deep dive
- Code: