Sift: Local Hybrid Search Without the Infrastructure Tax

Published: 1 month ago (March 10, 2026 at 12:33 AM EDT)

4 min read

Source: Dev.to

Source: Dev.to

Overview

sift is a local Rust CLI for document retrieval. Point it at a directory, ask a question, and it runs a full hybrid search pipeline—BM25, dense vector, fusion, optional reranking—and returns ranked results. No daemon, no background indexer, no cloud. One binary.

It’s built for agents and developers who need reliable, repeatable search over raw codebases, docs, and mixed‑format corpora without spinning up infrastructure.

You can install it now on macOS, Windows, and Linux.

The retrieval pipeline

Every query runs through four stages:

Expansion – query variants are generated to broaden recall before retrieval begins.
Retrieval – BM25 (keyword), phrase match, and dense vector retrieval run against the corpus. Each method captures different signal.
Fusion – results are merged using Reciprocal Rank Fusion (RRF), balancing signal across retrieval methods without manual weight tuning.
Reranking – optional local LLM reranking via Qwen applies semantic disambiguation on the fused candidate set.

Each stage is independently tunable. Skip the vector pass if you only need BM25 speed. Run the full stack for best precision.

Architecture

The implementation is split into domain and adapters:

Domain objects model search plans, candidates, and scoring outputs.
Adapters implement the concrete BM25, phrase, vector, and reranking backends.

A shared search service executes the same strategy model for CLI, benchmark, and evaluation flows—nothing changes between a dev run and a CI eval pass.

Performance highlights

SIMD‑accelerated dot‑product for vector scoring on CPU‑heavy workloads.
Zig‑inspired incremental cache – a two‑layer design borrowed from Zig’s build system. A manifest store tracks filesystem metadata (inode, mtime, size) mapped to BLAKE3 content hashes, so sift knows exactly which files have changed without re‑reading them. A content‑addressable blob store holds pre‑extracted text, pre‑computed BM25 term frequencies, and pre‑embedded dense vectors—meaning repeat queries never touch the neural network at all. Identical files across different projects share a single blob entry.
Per‑query embedding reuse across multi‑stage pipelines.
Mapped I/O and tight tokenization hot loops to keep latency low on large corpora.

One concrete trade‑off during development: lowering the embedding max_length from 48 to 40 recovered latency budget while keeping quality above the BM25 baseline—a good example of evidence‑driven tuning beating guesswork.

Full internals are documented in ARCHITECTURE.md.

Evaluation

Comparative strategy run over 5,185 SciFact documents (~7.8 MB) on an AMD Ryzen Threadripper 3960X:

Strategy	nDCG@10	MRR@10	Recall@10	p50 (ms)
bm25	0.7262	0.7000	0.8000	5.41
legacy‑hybrid	0.7893	0.7250	1.0000	50.29
page‑index	0.7000	0.6667	0.8000	16.79
page‑index‑hybrid	0.5701	0.4367	1.0000	41.09
page‑index‑llm	0.7893	0.7250	1.0000	41.28
page‑index‑qwen	0.7893	0.7250	1.0000	41.18
vector	0.8262	0.7667	1.0000	25.94

Key takeaways

BM25 at 5.41 ms p50 is the right default for latency‑constrained cases where keyword recall is sufficient.
Vector achieves the best nDCG@10 (0.8262) and perfect recall at 25.94 ms — the most balanced strategy for most workloads.
LLM reranking (page‑index‑llm, page‑index‑qwen) matches legacy‑hybrid quality at comparable speed, validating the local Qwen path as a practical alternative to heavier hybrid pipelines.
page‑index‑hybrid is the only strategy that underperforms BM25 on nDCG, reminding us that added complexity doesn’t always improve quality.

Cache hit rates (100/0/100 %) confirm the caching layer works correctly across all strategies. Verbose output (-v, -vv) surfaces cache hit rates, phase timings, and ranking metadata directly in the CLI.

Why this matters for agents

For agents, latency and reliability are requirements, not nice‑to‑haves. Tooling loops fail hard when search is slow, drops context, or depends on services that may be unavailable.

sift removes that friction: retrieval is local, deterministic, and cheap to repeat. No daemon to health‑check. No embedding service to rate‑limit against. No cloud dependency to manage. The binary ships with Homebrew and static Linux artifact support, so agents can rely on a pinned version without environment drift.

How it was built

The project shipped in a focused, nearly uninterrupted 24‑hour push—implementation, evaluation design, benchmarking, performance tightening, packaging, and release preparation in one sustained flow. Every major unit had acceptance criteria and measurable evidence attached before it was marked done.

What made that pace possible is something I’m not ready to discuss in detail yet. But sift is the first real proof that it works at speed, under real constraints, without cutting corners. More on that soon.

Get started

README — installation and basic usage
CONFIGURATION — strategy and model settings
EVALUATION — running your own corpus evals
ARCHITECTURE — internals deep dive
Code:

Sift: Local Hybrid Search Without the Infrastructure Tax

Overview

The retrieval pipeline

Architecture

Performance highlights

Evaluation

Key takeaways

Why this matters for agents

How it was built

Get started

Related posts

I Generate $50 Stock Videos With One Command. Here's the Open-Source Tool.

I Never Record Code Reviews Anymore — My Rust Tool Turns Any Git Diff Into a Video

Auditing Solana CPI Chains: How Static Analysis Tools Catch the Vulnerabilities That Manual Review Misses

Show HN: Lux – Drop-in Redis replacement in Rust. 5.6x faster, ~1MB Docker image