Design a Distributed Key-Value Store

Published: (May 21, 2026 at 11:57 PM EDT)
6 min read
Source: Dev.to

Source: Dev.to

Key‑Value Stores: Design and Architecture

A key‑value store is a non‑relational database that maps unique keys to values (usually a string or an object).
It supports two core operations:

put(key, value)
get(key)

The concept is simple, but serious trade‑offs appear at scale – every design must balance read performance, write performance, and memory usage.


1. Single‑node vs. Distributed

AspectSingle nodeDistributed (multiple nodes)
AtomicityAll operations are atomic – reads always return the latest state, writes are immediately consistent.Requires coordination; network partitions are inevitable.
Scaling limitHits capacity limits as data grows.Data is sharded across many nodes; design becomes more complex.
Coordination overheadNone.Must handle partitions, replication, quorum, etc.

2. The CAP Theorem

When a network partition occurs you can guarantee at most two of the following three properties:

PropertyDescription
ConsistencyEvery read receives the most recent write.
AvailabilityEvery request receives a (non‑error) response.
Partition ToleranceThe system continues to operate despite network partitions.

Ideal scenario: no partition → full consistency and availability.
When a partition occurs you must choose:

  • Consistency – block writes until the failed node recovers (e.g., banking).
  • Availability – keep serving reads/writes; stale reads may occur.

3. Data Partitioning – Consistent Hashing

  1. Hash ring – both servers and keys are mapped onto a circular space.
  2. Key placement – a key is stored on the first server encountered when moving clockwise from the key’s position.

Problems with naïve consistent hashing

  • Uneven key distribution.
  • Unequal partition sizes.

Solution – Virtual Nodes
Each physical server occupies multiple positions on the ring, smoothing distribution proportionally to its capacity.

Benefits

  • Automatic scaling.
  • Higher storage capacity.
  • Improved data locality.
  • Support for heterogeneous servers.

4. Replication & Quorum

Replication stores copies of each key on N distinct nodes (ideally across separate data‑centers).

If a node fails, other replicas can still serve the request.
When many nodes fail simultaneously, operations may block – this is where quorum helps.

4.1 Quorum Basics

Given:

  • N – total replicas.
  • R – number of replicas that must acknowledge a read.
  • W – number of replicas that must acknowledge a write.
ConfigurationEffect
R = 1, W = NOptimized for fast reads.
W = 1, R = NOptimized for fast writes.
W + R > NStrong consistency guaranteed (e.g., N=3, W=R=2).
W + R ≤ NStrong consistency not guaranteed.

Key insight: W + R > N forces every read set and every write set to overlap, guaranteeing at least one node holds the latest data.

4.2 Sloppy Quorum

Instead of requiring acknowledgements from a strict set of N nodes, any available nodes may temporarily handle the request when the designated nodes are down.
The system remains responsive during partial failures and still satisfies W + R > N once the cluster stabilizes.

4.3 Versioning & Conflict Resolution

  • Each write increments a version number (or uses a vector clock).
  • On a subsequent write, the client first fetches the latest version, increments it, and writes back.
  • Readers compare versions and discard stale data.

DynamoDB uses this reconciliation technique on the read path.


5. Consistency Models

ModelGuarantees
Strong consistencyAll reads return the most up‑to‑date value.
Weak consistencyReads may return stale data.
Eventual consistencyReads may be stale, but replicas converge over time.

6. Failure Detection – Gossip Protocol

  • Each node keeps a membership list with heartbeat counters.
  • Nodes periodically increment their counters and gossip them to random peers.
  • If a node’s heartbeat stops incrementing beyond a threshold, it is marked offline.

Advantages

  • Decentralized, low‑overhead, and scales well.

Handling failures

Failure typeTechnique
TemporaryHinted handoff – a neighbor stores writes for the unavailable node and syncs them back later.
PermanentAnti‑entropy with Merkle trees – efficiently identifies differing data portions between replicas, minimizing transfer.
Data‑center outageCross‑data‑center replication.

7. Core Design Principles

  1. Client APIget(key) and put(key, value).
  2. Coordinator node – acts as a proxy between client and cluster.
  3. Consistent‑hashing ring – distributes keys across nodes.
  4. Full decentralisation – no single point of failure; every node is symmetric.

Write path

  1. Append to a commit log (durability).
  2. Write to an in‑memory cache (e.g., MemTable).
  3. When the cache exceeds a threshold, flush to an SSTable on disk.

Read path

  1. Check the in‑memory cache – return immediately on a hit.
  2. On a miss:
    • Consult the Bloom filter to identify which SSTables may contain the key.
    • Query the relevant SSTables and return the result.

8. Component Overview

ComponentPurpose
Consistent hashingEvenly distribute keys across nodes.
Virtual nodesFix uneven distribution; map physical servers to multiple ring positions.
Commit logProvide durability for writes.
MemTable (cache)Fast in‑memory writes; later flushed to disk.
SSTableImmutable on‑disk storage; enables efficient reads.
Bloom filterQuickly test whether an SSTable might contain a key.
Merkle treeEfficient anti‑entropy for replica synchronization.
Gossip protocolDecentralised failure detection and membership management.
Hinted handoffTemporary write buffering for unavailable nodes.
Quorum (R/W)Control read/write consistency guarantees.

This cleaned‑up markdown preserves the original structure and content while improving readability and formatting.

Consistent Hashing – Key Concepts

ConceptPurpose
ReplicationPrevent data loss; ensure availability
Quorum (W + R > N)Balance consistency and availability
VersioningResolve conflicts between replicas
Gossip protocolDecentralized failure detection
Hinted handoffHandle temporary node failures
Merkle treesEfficient sync after permanent failures
Bloom filtersFast disk lookups on the read path
0 views
Back to Blog

Related posts

Read more »