Design a Distributed Key-Value Store

Published: 2 weeks ago (May 21, 2026 at 11:57 PM EDT)

6 min read

Source: Dev.to

Key‑Value Stores: Design and Architecture

A key‑value store is a non‑relational database that maps unique keys to values (usually a string or an object).
It supports two core operations:

put(key, value)
get(key)

The concept is simple, but serious trade‑offs appear at scale – every design must balance read performance, write performance, and memory usage.

1. Single‑node vs. Distributed

Aspect	Single node	Distributed (multiple nodes)
Atomicity	All operations are atomic – reads always return the latest state, writes are immediately consistent.	Requires coordination; network partitions are inevitable.
Scaling limit	Hits capacity limits as data grows.	Data is sharded across many nodes; design becomes more complex.
Coordination overhead	None.	Must handle partitions, replication, quorum, etc.

2. The CAP Theorem

When a network partition occurs you can guarantee at most two of the following three properties:

Property	Description
Consistency	Every read receives the most recent write.
Availability	Every request receives a (non‑error) response.
Partition Tolerance	The system continues to operate despite network partitions.

Ideal scenario: no partition → full consistency and availability.
When a partition occurs you must choose:

Consistency – block writes until the failed node recovers (e.g., banking).
Availability – keep serving reads/writes; stale reads may occur.

3. Data Partitioning – Consistent Hashing

Hash ring – both servers and keys are mapped onto a circular space.
Key placement – a key is stored on the first server encountered when moving clockwise from the key’s position.

Problems with naïve consistent hashing

Uneven key distribution.
Unequal partition sizes.

Solution – Virtual Nodes
Each physical server occupies multiple positions on the ring, smoothing distribution proportionally to its capacity.

Benefits

Automatic scaling.
Higher storage capacity.
Improved data locality.
Support for heterogeneous servers.

4. Replication & Quorum

Replication stores copies of each key on N distinct nodes (ideally across separate data‑centers).

If a node fails, other replicas can still serve the request.
When many nodes fail simultaneously, operations may block – this is where quorum helps.

4.1 Quorum Basics

Given:

N – total replicas.
R – number of replicas that must acknowledge a read.
W – number of replicas that must acknowledge a write.

Configuration	Effect
`R = 1, W = N`	Optimized for fast reads.
`W = 1, R = N`	Optimized for fast writes.
`W + R > N`	Strong consistency guaranteed (e.g., N=3, W=R=2).
`W + R ≤ N`	Strong consistency not guaranteed.

Key insight: W + R > N forces every read set and every write set to overlap, guaranteeing at least one node holds the latest data.

4.2 Sloppy Quorum

Instead of requiring acknowledgements from a strict set of N nodes, any available nodes may temporarily handle the request when the designated nodes are down.
The system remains responsive during partial failures and still satisfies W + R > N once the cluster stabilizes.

4.3 Versioning & Conflict Resolution

Each write increments a version number (or uses a vector clock).
On a subsequent write, the client first fetches the latest version, increments it, and writes back.
Readers compare versions and discard stale data.

DynamoDB uses this reconciliation technique on the read path.

5. Consistency Models

Model	Guarantees
Strong consistency	All reads return the most up‑to‑date value.
Weak consistency	Reads may return stale data.
Eventual consistency	Reads may be stale, but replicas converge over time.

6. Failure Detection – Gossip Protocol

Each node keeps a membership list with heartbeat counters.
Nodes periodically increment their counters and gossip them to random peers.
If a node’s heartbeat stops incrementing beyond a threshold, it is marked offline.

Advantages

Decentralized, low‑overhead, and scales well.

Handling failures

Failure type	Technique
Temporary	Hinted handoff – a neighbor stores writes for the unavailable node and syncs them back later.
Permanent	Anti‑entropy with Merkle trees – efficiently identifies differing data portions between replicas, minimizing transfer.
Data‑center outage	Cross‑data‑center replication.

7. Core Design Principles

Client API – get(key) and put(key, value).
Coordinator node – acts as a proxy between client and cluster.
Consistent‑hashing ring – distributes keys across nodes.
Full decentralisation – no single point of failure; every node is symmetric.

Write path

Append to a commit log (durability).
Write to an in‑memory cache (e.g., MemTable).
When the cache exceeds a threshold, flush to an SSTable on disk.

Read path

Check the in‑memory cache – return immediately on a hit.
On a miss:
- Consult the Bloom filter to identify which SSTables may contain the key.
- Query the relevant SSTables and return the result.

8. Component Overview

Component	Purpose
Consistent hashing	Evenly distribute keys across nodes.
Virtual nodes	Fix uneven distribution; map physical servers to multiple ring positions.
Commit log	Provide durability for writes.
MemTable (cache)	Fast in‑memory writes; later flushed to disk.
SSTable	Immutable on‑disk storage; enables efficient reads.
Bloom filter	Quickly test whether an SSTable might contain a key.
Merkle tree	Efficient anti‑entropy for replica synchronization.
Gossip protocol	Decentralised failure detection and membership management.
Hinted handoff	Temporary write buffering for unavailable nodes.
Quorum (R/W)	Control read/write consistency guarantees.

This cleaned‑up markdown preserves the original structure and content while improving readability and formatting.

Consistent Hashing – Key Concepts

Concept	Purpose
Replication	Prevent data loss; ensure availability
Quorum (W + R > N)	Balance consistency and availability
Versioning	Resolve conflicts between replicas
Gossip protocol	Decentralized failure detection
Hinted handoff	Handle temporary node failures
Merkle trees	Efficient sync after permanent failures
Bloom filters	Fast disk lookups on the read path