๐ 'Vector Sharding': ์ํ๋ฒณ์ด ์๋ ๋์๊ด์ ์กฐ์งํ๋ ๋ฐฉ๋ฒ ๐๐งฉ
Source: Dev.to
Welcome back to our AI at Scale series! ๐
In our last post we explored Semantic Cachingโthe โbrainyโ way to save money and time by remembering what weโve already asked our AI. As your application grows from a few thousand users to millions, you hit a massive wall: the memory limit.
The Challenge of Vector Databases
Imagine you are the librarian of the worldโs most advanced library. Instead of books being organized by title, they are organized by โvibeโ (vectors). If someone wants a book about โlonely robots in space,โ you have to search the entire library to find the closest match.
- Memory: You canโt fit the index of 1โฏbillion โvibesโ in a single serverโs RAM.
- Speed: Searching through a billion items for every user request is slowโeven for a computer.
Sharding: Splitting the Library
When one machine is too small for the job, we shard.
Sharding is the process of splitting a massive database into smaller, manageable chunks called shards. Each shard lives on a different server.
Traditional vs. Vector Sharding
| Traditional DB | Vector DB |
|---|---|
| ๊ฒฐ์ ๋ก ์ ํค(์: User ID)๋ก ์ค๋ | ์ ์ฌ์ฑ์ผ๋ก ์ค๋(๋ณด๋ค ๋ณต์กํจ) |
Two Main Approaches
1. Uniform Distribution
- Spread your 1โฏbillion vectors across 10 servers (โ100โฏmillion each).
- Aggregator sends each query to all 10 servers simultaneously.
- Merge: Each server returns its topโฏ5 matches (total 50). The aggregator picks the best of the best.
2. MetadataโBased Sharding
If your data has clear categories (e.g., โLanguageโ or โProduct Categoryโ), shard based on those metadata tags.
- Benefit: If a user searches only within โMedical Research,โ you query only the โMedicalโ shards, leaving โSportsโ and โCookingโ shards free for other traffic.
HNSW and Memory Constraints
Most modern vector databases use HNSW (Hierarchical Navigable Small World), a โsix degrees of separationโ map for highโdimensional data.
- RAM Requirement: HNSW needs to live in RAM to be fast.
- Problem: A 500โฏGB index on a server with 128โฏGB RAM forces swapping to disk, turning a 50โฏms search into several seconds.
Sharding keeps each HNSW index small enough to stay entirely in highโspeed memory.
Tradeโoffs and Engineering Considerations
- Replication: If a shard server fails, you lose that portion of memory. Replicas of every shard are required for resilience.
- Rebalancing: As data grows, some shards become โhotter.โ Moving millions of vectors between servers while the system is live is a major engineering challenge.
Why Vector Sharding Matters
Vector sharding is the difference between a cool AI demo and a topโtier AI platform. It forces highโdimensional math to work within the physical limits of hardware.
Next in the โAI at Scaleโ series: Rate Limiting for LLM APIs โ How to keep your API keys from melting under pressure.