Design a Distributed Key-Value Store
Source: Dev.to
Key‑Value Stores: Design and Architecture
A key‑value store is a non‑relational database that maps unique keys to values (usually a string or an object).
It supports two core operations:
put(key, value)
get(key)
The concept is simple, but serious trade‑offs appear at scale – every design must balance read performance, write performance, and memory usage.
1. Single‑node vs. Distributed
| Aspect | Single node | Distributed (multiple nodes) |
|---|---|---|
| Atomicity | All operations are atomic – reads always return the latest state, writes are immediately consistent. | Requires coordination; network partitions are inevitable. |
| Scaling limit | Hits capacity limits as data grows. | Data is sharded across many nodes; design becomes more complex. |
| Coordination overhead | None. | Must handle partitions, replication, quorum, etc. |
2. The CAP Theorem
When a network partition occurs you can guarantee at most two of the following three properties:
| Property | Description |
|---|---|
| Consistency | Every read receives the most recent write. |
| Availability | Every request receives a (non‑error) response. |
| Partition Tolerance | The system continues to operate despite network partitions. |
Ideal scenario: no partition → full consistency and availability.
When a partition occurs you must choose:
- Consistency – block writes until the failed node recovers (e.g., banking).
- Availability – keep serving reads/writes; stale reads may occur.
3. Data Partitioning – Consistent Hashing
- Hash ring – both servers and keys are mapped onto a circular space.
- Key placement – a key is stored on the first server encountered when moving clockwise from the key’s position.
Problems with naïve consistent hashing
- Uneven key distribution.
- Unequal partition sizes.
Solution – Virtual Nodes
Each physical server occupies multiple positions on the ring, smoothing distribution proportionally to its capacity.
Benefits
- Automatic scaling.
- Higher storage capacity.
- Improved data locality.
- Support for heterogeneous servers.
4. Replication & Quorum
Replication stores copies of each key on N distinct nodes (ideally across separate data‑centers).
If a node fails, other replicas can still serve the request.
When many nodes fail simultaneously, operations may block – this is where quorum helps.
4.1 Quorum Basics
Given:
- N – total replicas.
- R – number of replicas that must acknowledge a read.
- W – number of replicas that must acknowledge a write.
| Configuration | Effect |
|---|---|
R = 1, W = N | Optimized for fast reads. |
W = 1, R = N | Optimized for fast writes. |
W + R > N | Strong consistency guaranteed (e.g., N=3, W=R=2). |
W + R ≤ N | Strong consistency not guaranteed. |
Key insight: W + R > N forces every read set and every write set to overlap, guaranteeing at least one node holds the latest data.
4.2 Sloppy Quorum
Instead of requiring acknowledgements from a strict set of N nodes, any available nodes may temporarily handle the request when the designated nodes are down.
The system remains responsive during partial failures and still satisfies W + R > N once the cluster stabilizes.
4.3 Versioning & Conflict Resolution
- Each write increments a version number (or uses a vector clock).
- On a subsequent write, the client first fetches the latest version, increments it, and writes back.
- Readers compare versions and discard stale data.
DynamoDB uses this reconciliation technique on the read path.
5. Consistency Models
| Model | Guarantees |
|---|---|
| Strong consistency | All reads return the most up‑to‑date value. |
| Weak consistency | Reads may return stale data. |
| Eventual consistency | Reads may be stale, but replicas converge over time. |
6. Failure Detection – Gossip Protocol
- Each node keeps a membership list with heartbeat counters.
- Nodes periodically increment their counters and gossip them to random peers.
- If a node’s heartbeat stops incrementing beyond a threshold, it is marked offline.
Advantages
- Decentralized, low‑overhead, and scales well.
Handling failures
| Failure type | Technique |
|---|---|
| Temporary | Hinted handoff – a neighbor stores writes for the unavailable node and syncs them back later. |
| Permanent | Anti‑entropy with Merkle trees – efficiently identifies differing data portions between replicas, minimizing transfer. |
| Data‑center outage | Cross‑data‑center replication. |
7. Core Design Principles
- Client API –
get(key)andput(key, value). - Coordinator node – acts as a proxy between client and cluster.
- Consistent‑hashing ring – distributes keys across nodes.
- Full decentralisation – no single point of failure; every node is symmetric.
Write path
- Append to a commit log (durability).
- Write to an in‑memory cache (e.g., MemTable).
- When the cache exceeds a threshold, flush to an SSTable on disk.
Read path
- Check the in‑memory cache – return immediately on a hit.
- On a miss:
- Consult the Bloom filter to identify which SSTables may contain the key.
- Query the relevant SSTables and return the result.
8. Component Overview
| Component | Purpose |
|---|---|
| Consistent hashing | Evenly distribute keys across nodes. |
| Virtual nodes | Fix uneven distribution; map physical servers to multiple ring positions. |
| Commit log | Provide durability for writes. |
| MemTable (cache) | Fast in‑memory writes; later flushed to disk. |
| SSTable | Immutable on‑disk storage; enables efficient reads. |
| Bloom filter | Quickly test whether an SSTable might contain a key. |
| Merkle tree | Efficient anti‑entropy for replica synchronization. |
| Gossip protocol | Decentralised failure detection and membership management. |
| Hinted handoff | Temporary write buffering for unavailable nodes. |
| Quorum (R/W) | Control read/write consistency guarantees. |
This cleaned‑up markdown preserves the original structure and content while improving readability and formatting.
Consistent Hashing – Key Concepts
| Concept | Purpose |
|---|---|
| Replication | Prevent data loss; ensure availability |
| Quorum (W + R > N) | Balance consistency and availability |
| Versioning | Resolve conflicts between replicas |
| Gossip protocol | Decentralized failure detection |
| Hinted handoff | Handle temporary node failures |
| Merkle trees | Efficient sync after permanent failures |
| Bloom filters | Fast disk lookups on the read path |