[Paper] Passing the Baton: High Throughput Distributed Disk-Based Vector Search with BatANN

Published: (December 10, 2025 at 12:38 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09331v1

Overview

The paper introduces BatANN, a distributed, disk‑based approximate nearest‑neighbor (ANN) engine that lets you search billions of high‑dimensional vectors across many machines without sacrificing the low‑latency, logarithmic‑time guarantees of a single‑node graph index. By “passing the baton” – handing off the full query state to the node that owns the next part of the graph – BatANN achieves near‑linear throughput scaling while keeping latency in the single‑digit‑millisecond range.

Key Contributions

  • Single Global Graph Across Nodes – First open‑source system that shards a single ANN graph over multiple servers, preserving the logarithmic search path length.
  • Baton‑Passing Query Model – When a search step reaches a neighbor stored on another machine, the entire query context is transferred, allowing the remote node to continue the walk locally and avoid round‑trip chatter.
  • Near‑Linear Throughput Scaling – Demonstrated 2.5‑6.5× higher throughput than a naïve scatter‑gather baseline on 100 M‑ and 1 B‑vector datasets using only standard TCP.
  • Low Latency on Disk‑Based Indexes – Maintains sub‑6 ms mean latency even when the index resides on SSDs and the dataset far exceeds RAM.
  • Open‑Source Release – Provides the community with a production‑ready codebase for building large‑scale vector search services.

Methodology

  1. Graph Construction – Build a hierarchical navigable small world (HNSW)‑style graph on the full dataset, then partition the graph’s vertices across servers using a simple hash‑based sharding scheme.
  2. Query State Packaging – A query consists of the current node ID, the query vector, and a small priority queue of candidate neighbors. When the walk steps onto a vertex owned by another server, this entire state packet is sent over TCP.
  3. Remote Execution – The receiving server resumes the graph walk locally, expanding neighbors, updating the candidate queue, and possibly forwarding the baton again. No central coordinator is needed after the initial request.
  4. Termination – The walk stops when the candidate queue stabilizes (i.e., no better neighbors are found) and the final top‑k results are returned to the client.
  5. Evaluation – Benchmarks were run on commodity servers with SSD storage, measuring recall (≈0.95), throughput (queries per second), and mean latency across varying cluster sizes (1‑10 nodes) and dataset scales (100 M and 1 B vectors).

Results & Findings

DatasetNodesRecallThroughput (× baseline)Mean Latency
100 M vectors100.956.21 – 6.49×< 6 ms
1 B vectors100.952.5 – 5.10×< 6 ms
  • Scalability: Throughput grows almost linearly with added servers, confirming the effectiveness of baton passing.
  • Network Efficiency: Using plain TCP (no RDMA or custom protocols) still yields high performance, showing the approach is robust to typical data‑center networking.
  • Recall vs. Speed Trade‑off: BatANN matches the recall of a single‑node HNSW index while delivering orders of magnitude higher query rates on massive datasets.

Practical Implications

  • RAG & LLM Pipelines: Developers can now attach a truly massive vector store (billions of embeddings) to retrieval‑augmented generation systems without incurring prohibitive latency, enabling richer context windows.
  • Search‑as‑a‑Service: SaaS providers can offer high‑throughput similarity search for images, audio, or code snippets while keeping hardware costs low by offloading most of the index to cheap SSD‑backed nodes.
  • Hybrid Cloud Deployments: Because BatANN works over standard TCP, it can be deployed across on‑premise clusters, public clouds, or edge locations without special networking stacks.
  • Open‑Source Extensibility: Teams can fork the repository to integrate custom distance metrics, security layers (TLS, auth), or combine with existing vector databases (e.g., Milvus, Vespa) for a unified stack.

Limitations & Future Work

  • Sharding Simplicity: The current hash‑based partitioning may lead to load imbalance for skewed data distributions; more sophisticated graph‑aware partitioners could improve uniformity.
  • Fault Tolerance: The paper focuses on performance; handling node failures while preserving query correctness is left as future engineering work.
  • GPU Acceleration: All experiments run on CPU‑only nodes; exploring GPU‑based distance calculations could further boost throughput for compute‑heavy metrics.
  • Dynamic Updates: Insertion and deletion of vectors require rebuilding or complex rebalancing; incremental update mechanisms are an open research direction.

BatANN shows that with a clever query‑handoff strategy, distributed disk‑based vector search can be both fast and scalable—opening the door for developers to build next‑generation retrieval services on petabyte‑scale embedding stores.

Authors

  • Nam Anh Dang
  • Ben Landrum
  • Ken Birman

Paper Information

  • arXiv ID: 2512.09331v1
  • Categories: cs.DC, cs.IR
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »