[Paper] Fantasy: Efficient Large-scale Vector Search on GPU Clusters with GPUDirect Async

Published: 3 days ago (December 1, 2025 at 06:47 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.02278v1

Overview

The paper introduces Fantasy, a system that lets you run massive vector similarity searches across a GPU cluster without being bottlenecked by data movement. By tightly coupling GPU computation with GPUDirect‑Async networking, Fantasy keeps GPUs fed with data even when the index is far larger than a single GPU’s memory, delivering high‑recall, low‑latency results at scale.

Key Contributions

GPU‑cluster‑wide search pipeline – a design that overlaps index loading, network transfer, and similarity computation, eliminating idle GPU time.
GPUDirect‑Async integration – leverages direct NIC‑to‑GPU memory transfers to bypass the CPU, cutting data‑movement latency dramatically.
Scalable graph handling – supports graph‑based indexes (e.g., HNSW) that exceed the memory of any single GPU, distributing them across many nodes.
Large‑batch query support – enables processing of thousands of queries per batch, improving throughput for real‑time AI services.
Open‑source prototype & evaluation – provides a reference implementation and extensive benchmarks on multi‑node GPU clusters.

Methodology

Fantasy treats vector search as a two‑stage pipeline:

Data‑plane (GPUDirect‑Async) – The NIC streams required graph partitions directly into GPU memory, avoiding the host CPU and system RAM.
Compute‑plane (GPU kernels) – While one batch of queries is being processed, the next batch’s graph data is already being fetched, so the GPU never stalls.

The authors built a scheduler that partitions the global graph into shards, assigns each shard to a GPU, and orchestrates asynchronous transfers using CUDA streams and NCCL for inter‑GPU communication. The search algorithm itself remains a standard graph‑based nearest‑neighbor walk (e.g., HNSW), but the surrounding infrastructure ensures that the walk never waits for data.

Results & Findings

Throughput boost: Fantasy achieved up to 5× higher query‑per‑second (QPS) compared to a baseline CPU‑GPU hybrid that loads data synchronously.
Latency reduction: End‑to‑end latency dropped from ~12 ms to < 3 ms for 128‑dimensional vectors on a 4‑node (8‑GPU) cluster.
Scalability: The system maintained > 80 % of peak GPU compute utilization as the index grew from 10 M to 200 M vectors, far beyond a single‑GPU memory limit (~24 GB).
Batch size impact: With batch sizes of 4 K–8 K queries, Fantasy’s pipeline kept the GPUs saturated, whereas smaller batches suffered from frequent stalls.

Practical Implications

LLM‑powered retrieval: Services that need to fetch relevant document embeddings (e.g., RAG pipelines) can now serve millions of queries per second without massive CPU farms.
Recommendation & search engines: Real‑time similarity look‑ups on product or user embeddings become feasible on existing GPU clusters, reducing infrastructure cost.
Edge‑to‑cloud hybrid: By offloading only the graph shards needed for a request, developers can design “elastic” retrieval services that scale with demand.
Simplified stack: Eliminating the CPU‑side loading step means fewer moving parts, easier deployment, and lower latency variance—critical for SLA‑bound applications.

Limitations & Future Work

Network dependency: The gains hinge on high‑bandwidth, low‑latency interconnects (e.g., InfiniBand). Clusters with slower Ethernet may see diminished benefits.
Graph‑type focus: Fantasy is evaluated primarily on HNSW‑style indexes; extending the pipeline to other ANN structures (e.g., IVF‑PQ) remains an open question.
Fault tolerance: The current prototype assumes a stable cluster; handling node failures or dynamic scaling would require additional coordination logic.
Memory fragmentation: As shards are streamed in/out, GPU memory fragmentation can affect long‑running workloads; smarter memory management is a possible improvement.

Fantasy demonstrates that with the right orchestration, GPUs can handle truly massive vector search workloads without being throttled by data movement—a promising direction for any developer building next‑generation AI‑driven retrieval systems.

Authors

Yi Liu
Chen Qian

Paper Information

arXiv ID: 2512.02278v1
Categories: cs.DC
Published: December 1, 2025
PDF: Download PDF

[Paper] Fantasy: Efficient Large-scale Vector Search on GPU Clusters with GPUDirect Async

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Federated Learning for Terahertz Wireless Communication

[Paper] FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration

[Paper] Offloading to CXL-based Computational Memory

[Paper] A Structure-Aware Irregular Blocking Method for Sparse LU Factorization