[Paper] Passing the Baton: High Throughput Distributed Disk-Based Vector Search with BatANN
Source: arXiv - 2512.09331v1
Overview
The paper introduces BatANN, a distributed, disk‑based approximate nearest‑neighbor (ANN) engine that lets you search billions of high‑dimensional vectors across many machines without sacrificing the low‑latency, logarithmic‑time guarantees of a single‑node graph index. By “passing the baton” – handing off the full query state to the node that owns the next part of the graph – BatANN achieves near‑linear throughput scaling while keeping latency in the single‑digit‑millisecond range.
Key Contributions
- Single Global Graph Across Nodes – First open‑source system that shards a single ANN graph over multiple servers, preserving the logarithmic search path length.
- Baton‑Passing Query Model – When a search step reaches a neighbor stored on another machine, the entire query context is transferred, allowing the remote node to continue the walk locally and avoid round‑trip chatter.
- Near‑Linear Throughput Scaling – Demonstrated 2.5‑6.5× higher throughput than a naïve scatter‑gather baseline on 100 M‑ and 1 B‑vector datasets using only standard TCP.
- Low Latency on Disk‑Based Indexes – Maintains sub‑6 ms mean latency even when the index resides on SSDs and the dataset far exceeds RAM.
- Open‑Source Release – Provides the community with a production‑ready codebase for building large‑scale vector search services.
Methodology
- Graph Construction – Build a hierarchical navigable small world (HNSW)‑style graph on the full dataset, then partition the graph’s vertices across servers using a simple hash‑based sharding scheme.
- Query State Packaging – A query consists of the current node ID, the query vector, and a small priority queue of candidate neighbors. When the walk steps onto a vertex owned by another server, this entire state packet is sent over TCP.
- Remote Execution – The receiving server resumes the graph walk locally, expanding neighbors, updating the candidate queue, and possibly forwarding the baton again. No central coordinator is needed after the initial request.
- Termination – The walk stops when the candidate queue stabilizes (i.e., no better neighbors are found) and the final top‑k results are returned to the client.
- Evaluation – Benchmarks were run on commodity servers with SSD storage, measuring recall (≈0.95), throughput (queries per second), and mean latency across varying cluster sizes (1‑10 nodes) and dataset scales (100 M and 1 B vectors).
Results & Findings
| Dataset | Nodes | Recall | Throughput (× baseline) | Mean Latency |
|---|---|---|---|---|
| 100 M vectors | 10 | 0.95 | 6.21 – 6.49× | < 6 ms |
| 1 B vectors | 10 | 0.95 | 2.5 – 5.10× | < 6 ms |
- Scalability: Throughput grows almost linearly with added servers, confirming the effectiveness of baton passing.
- Network Efficiency: Using plain TCP (no RDMA or custom protocols) still yields high performance, showing the approach is robust to typical data‑center networking.
- Recall vs. Speed Trade‑off: BatANN matches the recall of a single‑node HNSW index while delivering orders of magnitude higher query rates on massive datasets.
Practical Implications
- RAG & LLM Pipelines: Developers can now attach a truly massive vector store (billions of embeddings) to retrieval‑augmented generation systems without incurring prohibitive latency, enabling richer context windows.
- Search‑as‑a‑Service: SaaS providers can offer high‑throughput similarity search for images, audio, or code snippets while keeping hardware costs low by offloading most of the index to cheap SSD‑backed nodes.
- Hybrid Cloud Deployments: Because BatANN works over standard TCP, it can be deployed across on‑premise clusters, public clouds, or edge locations without special networking stacks.
- Open‑Source Extensibility: Teams can fork the repository to integrate custom distance metrics, security layers (TLS, auth), or combine with existing vector databases (e.g., Milvus, Vespa) for a unified stack.
Limitations & Future Work
- Sharding Simplicity: The current hash‑based partitioning may lead to load imbalance for skewed data distributions; more sophisticated graph‑aware partitioners could improve uniformity.
- Fault Tolerance: The paper focuses on performance; handling node failures while preserving query correctness is left as future engineering work.
- GPU Acceleration: All experiments run on CPU‑only nodes; exploring GPU‑based distance calculations could further boost throughput for compute‑heavy metrics.
- Dynamic Updates: Insertion and deletion of vectors require rebuilding or complex rebalancing; incremental update mechanisms are an open research direction.
BatANN shows that with a clever query‑handoff strategy, distributed disk‑based vector search can be both fast and scalable—opening the door for developers to build next‑generation retrieval services on petabyte‑scale embedding stores.
Authors
- Nam Anh Dang
- Ben Landrum
- Ken Birman
Paper Information
- arXiv ID: 2512.09331v1
- Categories: cs.DC, cs.IR
- Published: December 10, 2025
- PDF: Download PDF