You Can't Resize a Bloom Filter. Here's What To Do Instead.
Source: Dev.to
What Are Bloom Filters?
Bloom filters are a probabilistic data structure that use a bit array and hash functions to reduce the load on main databases.
You might wonder why not just use a caching system like Redis. The answer comes down to a trade‑off between accuracy and memory efficiency:
| System | Accuracy | Memory Efficiency |
|---|---|---|
| Basic Redis (or similar caches) | ✅ Exact | ❌ Not size‑efficient |
| Bloom filter | ❓ Probabilistic (possible false positives) | ✅ Very size‑efficient |
Planning a Bloom Filter
Before using a bloom filter you must decide:
- Size of the bit array (
m) – once set, it cannot be changed. - Number of hash functions (
k) – also fixed after creation.
Both parameters depend on:
n– expected number of items to store.p– acceptable false‑positive probability.
Choosing the Number of Hash Functions
The number of hash functions cannot be too low (causing many collisions) nor too high (filling the array quickly and also causing collisions).
The optimal value is given by the formula:
[ k = \frac{m}{n},\ln 2 ]
Computing Optimal Parameters in Go
package main
import (
"math"
)
// CalculateOptimalParams returns the optimal bit‑array size (m) and
// number of hash functions (k) for a given item count (n) and false‑positive
// rate (p).
func CalculateOptimalParams(n int, p float64) (uint, uint) {
// 1. Ideal bit‑array size (m)
// m = - (n * ln(p)) / (ln(2))²
m := -float64(n) * math.Log(p) / math.Pow(math.Log(2), 2)
// 2. Ideal number of hash functions (k)
// k = (m / n) * ln(2)
k := (m / float64(n)) * math.Log(2)
return uint(math.Ceil(m)), uint(math.Ceil(k))
}
func main() {
n := 1000 // Expected number of items
p := 0.01 // Desired false‑positive rate (1%)
m, k := CalculateOptimalParams(n, p)
// Print the recommended parameters
println("Recommended Bit Array Size (m):", m)
println("Recommended Hash Functions (k):", k)
}
Core Logic of a Bloom Filter
A bloom filter answers the question “Does this item possibly exist in the set?”
If the answer is definitely NO, we can skip a database lookup.
If the answer is maybe, we must verify against the real data source.
Example Scenario: Username Availability
-
Traditional approach – Run a SQL query:
SELECT 1 FROM users WHERE username = 'ayush' LIMIT 1;Even with an index, the query cost is roughly
O(log n). -
Bloom‑filter approach –
Check the filter first.
If it says NO, accept the username immediately.
If it says maybe, fall back to the database query.
This can dramatically reduce database load.
How a Bloom Filter Works
A bloom filter consists of two components:
- Bit Array – An array of 0s and 1s (e.g.,
[0,0,0,0,0]). - Hash Functions – Deterministic functions that map a value to an index in the bit array.
Step‑by‑Step Walkthrough
Assume a bit array of 10 bits, all initially 0.
| Step | Action | Resulting Bit Array |
|---|---|---|
| 1. Add “Ayush” | hash1("Ayush") = 2hash2("Ayush") = 7 | [0,0,1,0,0,0,0,1,0,0] |
| 2. Check “Kumar” | hash1("Kumar") = 3hash2("Kumar") = 4 | Bits 3 and 4 are 0 → definitely NOT present |
| 3. Check “Anand” (a false positive) | hash1("Anand") = 2hash2("Anand") = 7 | Both bits are 1 → maybe present (actually not) → fall back to DB |
Go Implementation of a Simple Bloom Filter
package main
import (
"fmt"
"hash/fnv"
)
// BloomFilter holds the bit array and its size.
type BloomFilter struct {
bitset []bool // 0/1 array
size int // length of the array
}
// NewBloomFilter creates a filter with the given size.
func NewBloomFilter(size int) *BloomFilter {
return &BloomFilter{
bitset: make([]bool, size),
size: size,
}
}
// hash1 – first hash function (FNV‑1a).
func (bf *BloomFilter) hash1(s string) int {
h := fnv.New32a()
h.Write([]byte(s))
return int(h.Sum32()) % bf.size
}
// hash2 – second hash function (FNV‑1a with a salt).
func (bf *BloomFilter) hash2(s string) int {
h := fnv.New32a()
h.Write([]byte(s + "salt")) // simple salt to get a different hash
return int(h.Sum32()) % bf.size
}
// Add inserts an item into the filter.
func (bf *BloomFilter) Add(item string) {
i1 := bf.hash1(item)
i2 := bf.hash2(item)
bf.bitset[i1] = true
bf.bitset[i2] = true
fmt.Printf("Added: %s (indices %d, %d)\n", item, i1, i2)
}
// Exists checks whether an item *might* be in the set.
func (bf *BloomFilter) Exists(item string) bool {
i1 := bf.hash1(item)
i2 := bf.hash2(item)
// Both bits must be true for a "maybe" answer.
if bf.bitset[i1] && bf.bitset[i2] {
return true // maybe
}
return false // definitely not
}
func main() {
filter := NewBloomFilter(100)
// 1. Add "Ayush"
filter.Add("Ayush")
// 2. Query "Ayush"
fmt.Println("Does Ayush exist?", filter.Exists("Ayush")) // true (maybe)
// 3. Query "Kumar"
fmt.Println("Does Kumar exist?", filter.Exists("Kumar")) // false (definitely not)
}
Real‑World Uses
| Service | How It Uses Bloom Filters |
|---|---|
| Medium.com | Avoid recommending articles a user has already read. If the filter says “no”, the article is shown immediately; if “maybe”, a DB check confirms. |
| Google Chrome | Checks URLs against a list of known malicious sites. A “no” result blocks the site instantly; a “maybe” triggers a deeper verification. |
| Cassandra, HBase, Redis (Bloom module) | Reduce disk I/O for read‑heavy workloads by quickly ruling out non‑existent keys. |
| Distributed caches (e.g., Memcached) | Prevent cache‑miss storms by pre‑filtering keys that are unlikely to be present. |
Takeaways
- Bloom filters trade a small, controllable false‑positive rate for massive memory savings and fast membership checks.
- They are write‑only (no deletions) and static in size after creation.
- Properly sizing
mandkusing the formulas above is crucial to achieving the desired error rate.
Feel free to adapt the Go code to your own projects, adjust the number of hash functions, or replace the simple FNV hashes with stronger alternatives (e.g., MurmurHash, SipHash) for better distribution. Happy filtering!
Bloom Filters in Practice
-
Web browsers – When a URL is too large or suspicious, Chrome stores it in a Bloom filter.
- On a visit, Chrome checks the filter.
- If the filter returns “maybe dangerous”, Chrome contacts Google’s servers for confirmation.
-
Databases (Cassandra / PostgreSQL) – Before scanning a hard‑disk block for a row, the engine checks a Bloom filter.
- If the filter says the row is absent, the expensive disk I/O is avoided.
Scaling Bloom Filters
The size of a Bloom filter (array length m and number of hash functions k) is usually chosen based on an assumed maximum number of items n and a target false‑positive rate p.
When the actual data set grows beyond the original assumption, the filter must be scaled.
Scalable Bloom Filters (SBF)
Redis Stack implements stacked Bloom filters (also called Scalable Bloom Filters).
Each new filter added to the stack is larger and strictly more accurate than the previous one.
How the Stacking Algorithm Works
A. Write Rule – Adding Data
| Step | Action |
|---|---|
| 1 | Identify the newest (active) filter (e.g., Filter 3). |
| 2 | Insert the new element (e.g., “Ayush”) only into this active filter. |
| 3 | Older filters (Filter 1, Filter 2) remain frozen – they are never modified. |
Why? Frozen filters preserve their original false‑positive guarantees; only the newest filter handles fresh inserts.
B. Read Rule – Checking Data
To answer the query “Does Ayush exist?”:
- Check Filter 3 – result: No (Ayush isn’t in the newest filter).
- Check Filter 2 – result: No.
- Check Filter 1 – result: Yes (Ayush was added two months ago).
The final answer is the logical OR of all checks:
Result = F3.Exists() || F2.Exists() || F1.Exists()
Problems with Naïve Stacking
- Every additional filter introduces its own false‑positive probability.
- If each filter has a 1 % error rate, two filters give ≈ 2 % overall error, ten filters give ≈ 10 % error – an unacceptable increase.
How Redis Solves the Problem
Redis makes each successive filter tighter (lower false‑positive rate) than the one before it:
| Filter | Allowed Error Rate |
|---|---|
| Filter 1 | 0.1 % |
| Filter 2 | 0.01 % (10× stricter) |
| Filter 3 | 0.001 % (100× stricter) |
| … | … |
Because later filters are dramatically more precise, even a stack of 20 filters keeps the overall false‑positive rate well below 1 %.
Takeaway
- Bloom filters are widely used to avoid costly look‑ups.
- When the data set grows, Scalable Bloom Filters (stacked, increasingly strict filters) let you maintain a low overall error rate without rebuilding the entire structure.
- Redis’s implementation demonstrates that progressively tighter filters are the key to practical, low‑error scaling.