Making an ANN Like Faiss Is Not Everyone’s Cup of Tea
Source: Dev.to
(A survival guide you didn’t ask for)
Building an ANN system like Faiss isn’t hard.
Building a fast ANN system like Faiss will make you question every life decision you’ve ever made.
If you’re thinking, “How hard can vector search be?”… congratulations—this article is for you.
Act 1: The Innocent Beginning (Python Era)
You start in Python.
Life is good – NumPy works.
- Accuracy looks decent.
- Latency is… acceptable.
You tell yourself: “I’ll just prototype it. Later I’ll optimize.” – classic rookie mistake.
Act 2: “Let’s Rewrite It in C++” (Boss Music Starts)
At some point queries feel slow, and you say the forbidden words: “Let’s rewrite it in C++ for speed.”
Suddenly:
- You’re not debugging logic, you’re debugging existence.
- Segfaults, undefined behavior, memory crashes… for reasons you swear are illegal.
- Fix one bug → three new ones spawn.
Act 3: Speed‑Bound → Memory‑Bound (The Plot Twist)
Initially you’re speed‑bound:
- Bad loops
- Poor data layout
- Unoptimized math
You fix those, latency drops, you feel powerful—then nothing improves.
Realization: you’re no longer speed‑bound; you’re memory‑bound, and that’s where real suffering begins.
Act 4: Milliseconds Matter (You Finally Understand Big Tech)
Seconds were easy. Milliseconds are war.
- Change one file → latency spikes, QPS drops.
- Cache misses explode.
Lessons learned
- Cache misses can cost hundreds of QPS.
- Memory access latency dominates CPU speed.
- “Fast code” means nothing if data is in the wrong place.
Act 5: SIMD, AVX, OpenMP (False Hope Arc)
You go full try‑hard:
- SIMD, AVX2/AVX‑512
- OpenMP, BLAS
- Hand‑tuned loops
Reality hits again:
- Small batches → OpenMP overhead > benefit.
- Threads fight for cache; more cores ≠ more speed.
- Optimizations now need optimization.
Act 6: Python Bindings (New Boss, Same Pain)
“Fine, I’ll just expose this with Python bindings.”
Welcome to pybind11 + CMake hell:
- CMake can’t find pybind, or denies it.
- Cryptic compiler messages.
Additional headaches:
- Managing Python, C++, and NumPy memory.
- Recall drops, speed lies.
You realize NumPy math ≠ C++ speed and briefly consider throwing your CPU out the window.
Act 7: Scalar C++ Reality Check
You try pure scalar C++.
Surprise: well‑optimized NumPy/Cython can beat naïve C++.
Your ego segfaults.
Now you:
- Learn data alignment, cache lines, prefetching.
- Understand why “just C++” is not enough.
Final Act: The Faiss Reality Check
After all the tuning—memory, cache, layout, QPS, latency—you benchmark against Faiss and are nowhere near it.
Faiss isn’t just algorithms; it’s years of low‑level pain, tuning, and memory mastery.
Advice From Someone Who Survived (Barely)
Step 1: Start in Python.
- Build the algorithm first.
- Validate accuracy.
- If it’s good enough, stop here and be happy.
Step 2: Move to C++ only if:
- You hit real memory limits.
- You hit real latency ceilings.
- You understand what you’re signing up for.
Step 3: Optimization Hell
- SIMD, AVX, OpenMP (carefully)
- Cache‑aware design
- Memory‑first thinking
If you reach this stage… congrats. This is where hating your life officially begins.
- Writing an ANN engine is fun.
- Writing a fast ANN engine is pain.
- Writing one that competes with Faiss? That’s a boss‑fight marathon.
If you’re still here – respect. 🫡
If you’re thinking of starting – I warned you.
Now excuse me while I benchmark again and cry over cache misses.
So yeah… it’s already halfway. There is unfinished business. The ANN is coming. It will be open‑sourced—not “soon™”, not “startup soon”. But soon—the kind of soon where code already exists and the pain is already paid for.