[Paper] Optimizing Bloom Filters for Modern GPU Architectures

Published: (December 17, 2025 at 12:01 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.15595v1

Overview

This paper tackles a surprisingly under‑explored problem: how to make Bloom filters run at near‑optimal speed on modern GPUs. By systematically studying vectorization, thread cooperation, and compute‑latency tricks, the authors deliver a GPU‑native design that shatters the traditional speed‑vs‑accuracy trade‑off, achieving more than an order‑of‑magnitude speed‑up over prior GPU implementations while keeping error rates low.

Key Contributions

  • Design space exploration of Bloom filter implementations on GPUs across three orthogonal axes: SIMD‑style vectorization, intra‑warp/thread‑block cooperation, and latency‑hiding strategies.
  • Cache‑aware layout that keeps the entire filter inside the GPU’s L2 cache domain, unlocking the highest throughput.
  • A unified implementation that delivers both high‑precision (low false‑positive) and high‑throughput performance, eliminating the usual compromise between the two.
  • Performance breakthrough: 11.35× faster bulk lookups and 15.4× faster construction than the previous state‑of‑the‑art, reaching >92 % of the theoretical “speed‑of‑light” bound on an NVIDIA B200 GPU.
  • Open‑source modular CUDA/C++ library (to be released) that can be dropped into existing data‑processing pipelines with minimal changes.

Methodology

  1. Baseline & Prior Art: The authors start from a classic GPU Bloom filter implementation (single‑thread per hash, naive memory accesses) and from the best published GPU variants.
  2. Three‑dimensional optimization grid:
    • Vectorization: Pack multiple hash results into a single 32‑ or 64‑bit word and use CUDA warp‑wide primitives to apply them in parallel.
    • Thread Cooperation: Organize threads into warps and thread blocks that collaboratively load, update, and query the filter, reducing redundant memory traffic.
    • Latency Hiding: Overlap memory loads with computation by issuing asynchronous loads and employing shared‑memory staging buffers.
  3. Cache‑Fit Experiments: Vary filter size, number of hash functions, and block dimensions to keep the whole filter resident in the GPU’s L2 cache, measuring the impact on memory‑bandwidth utilization.
  4. Benchmark Suite: Synthetic workloads (bulk insert, bulk query) spanning error rates from 0.1 % to 10 % and real‑world traces (network packet filtering, k‑mer lookup in genomics). All experiments run on an NVIDIA B200 (Ampere) with CUDA 12.x.
  5. Analytical Model: The paper includes a simple performance model linking cache hit‑rate, warp occupancy, and instruction‑level parallelism to observed throughput, validating the model against empirical data.

Results & Findings

OperationSpeed‑up vs. Prior GPUThroughput (M ops/s)Error RateCache Fit?
Bulk Lookup11.35×1,850 M ops/s0.5 %Yes
Bulk Construction15.4×2,300 M ops/s0.5 %Yes
Small Filter (fits L2)12–16×up to 2.5 B ops/s0.1–5 %Yes
Large Filter (spills to DRAM)3–5×600–800 M ops/s0.1–5 %No
  • Cache residency is the dominant factor: When the filter stays inside L2, memory bandwidth is no longer the bottleneck; the kernel becomes compute‑bound and reaches >92 % of the theoretical maximum given the GPU’s clock rate and instruction mix.
  • No accuracy penalty: The same optimized kernel can be configured with many hash functions (high precision) without losing throughput, disproving the “high‑precision = slow” myth.
  • Scalability: Performance scales linearly with the number of active SMs, confirming that the design does not suffer from contention or serialization.

Practical Implications

  • Databases & Stream Processing: Systems that rely on Bloom filters for cache invalidation, join‑filtering, or deduplication can now offload these checks to GPUs and keep up with multi‑TB/s data streams without sacrificing false‑positive guarantees.
  • Network Security: Real‑time packet‑filtering appliances can embed the GPU‑accelerated filter to handle billions of lookups per second, enabling finer‑grained rule sets (more hash functions) without latency spikes.
  • Genomics & Bioinformatics: K‑mer existence queries, a classic Bloom‑filter use case, can be accelerated dramatically, shrinking pipeline runtimes from hours to minutes on commodity GPU servers.
  • Hybrid CPU‑GPU Architectures: The modular CUDA/C++ library can be called from existing C++ codebases, allowing developers to keep the filter in GPU memory while the rest of the pipeline runs on the CPU, achieving a clean separation of concerns.
  • Cost‑Efficiency: By squeezing more work out of a single GPU, organizations can reduce the number of required accelerator cards, lowering both capital and operational expenses (power, cooling).

Limitations & Future Work

  • Memory Footprint: The biggest gains require the entire filter to fit within the GPU’s L2 cache (typically a few hundred MB). Very large filters still fall back to DRAM‑bound performance.
  • Hardware Specificity: The evaluation focuses on NVIDIA Ampere (B200). While the design principles are portable, exact speed‑ups may differ on older architectures or on AMD GPUs.
  • Dynamic Updates: The current implementation excels at bulk construction and bulk queries; supporting high‑frequency incremental inserts/deletes would need additional synchronization mechanisms.
  • Future Directions: The authors plan to (1) explore hierarchical filters (e.g., cuckoo‑filter hybrids) that can spill gracefully to DRAM, (2) integrate the design into popular data‑processing frameworks (Apache Flink, Spark), and (3) benchmark on upcoming GPU generations with larger caches and tensor‑core‑style SIMD units.

Authors

  • Daniel Jünger
  • Kevin Kristensen
  • Yunsong Wang
  • Xiangyao Yu
  • Bertil Schmidt

Paper Information

  • arXiv ID: 2512.15595v1
  • Categories: cs.DC
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »