Design a Distributed Key-Value Store
Building a distributed key-value store from scratch is a deep system design question that tests your knowledge of distributed systems fundamentals. It covers consistent hashing, replication, consensus, and the CAP theorem.
Requirementsβ
Functional:
put(key, value)β Store a key-value pairget(key)β value β Retrieve a value by keydelete(key)β Remove a key- Keys and values are strings (values up to 10 MB)
Non-functional:
- 10B keys, average value size: 1 KB
- 100,000 reads/sec, 10,000 writes/sec
- High availability (99.99%)
- Tunable consistency
- Eventual consistency acceptable for reads
Scale Estimationβ
Total data: 10B keys Γ 1 KB avg value = 10 TB
With 3Γ replication: 30 TB total storage
If each node holds 500 GB β need 60 nodes
Reads: 100,000/sec distributed across nodes β ~1,700 reads/node/sec
Writes: 10,000/sec β ~167 writes/node/sec (manageable)
Key Design Decisionsβ
Data Partitioning: Consistent Hashingβ
With 60 nodes, how do you decide which node stores key="user:123"?
Naive approach: node = hash(key) % num_nodes
Problem: Adding/removing a node remaps almost every key β massive data movement.
Consistent hashing:
- Place nodes on a "ring" of hash space (0 to 2^32)
- Each key maps to the next node clockwise on the ring
- Adding/removing a node only remaps ~1/N of keys
Node A (hash: 0)
β±
Node D β³ key: "foo" (hash: 85)
(hash: 90) β² β maps to Node A
β°βββ Node B (hash: 100)
|
Node C (hash: 200)
When key "foo" is hashed to 85:
β Next node clockwise is Node A (hash: 100... wait, A is at 0)
β So it maps to the next node: B at 100
Virtual nodes β Each physical node has multiple positions on the ring (e.g., 150 virtual nodes per physical node). This ensures even distribution even with heterogeneous hardware.
class ConsistentHashRing {
constructor(replicationFactor = 150) {
this.replicationFactor = replicationFactor;
this.ring = new SortedMap(); // hash β node
this.nodes = new Map(); // nodeName β nodeInfo
}
addNode(nodeName, nodeInfo) {
for (let i = 0; i < this.replicationFactor; i++) {
const vnodeKey = `${nodeName}:vnode:${i}`;
const hash = this.hash(vnodeKey);
this.ring.set(hash, nodeName);
}
this.nodes.set(nodeName, nodeInfo);
}
removeNode(nodeName) {
for (let i = 0; i < this.replicationFactor; i++) {
const vnodeKey = `${nodeName}:vnode:${i}`;
const hash = this.hash(vnodeKey);
this.ring.delete(hash);
}
this.nodes.delete(nodeName);
}
getNodesForKey(key, count = 3) {
const keyHash = this.hash(key);
const nodes = [];
const seen = new Set();
// Walk clockwise from key's position, collecting `count` unique nodes
for (const [hash, nodeName] of this.ring.from(keyHash)) {
if (!seen.has(nodeName)) {
nodes.push({ nodeName, nodeInfo: this.nodes.get(nodeName) });
seen.add(nodeName);
if (nodes.length === count) break;
}
}
// Wrap around if needed
if (nodes.length < count) {
for (const [hash, nodeName] of this.ring) {
if (!seen.has(nodeName)) {
nodes.push({ nodeName, nodeInfo: this.nodes.get(nodeName) });
seen.add(nodeName);
if (nodes.length === count) break;
}
}
}
return nodes;
}
hash(key) {
// Use murmur3 or fnv1a for uniform distribution
return murmur3(key) >>> 0; // Unsigned 32-bit integer
}
}
Replicationβ
Store each key on N nodes (typically N=3) for durability and availability.
Client writes key "user:123"
β
βΌ
Coordinator node (primary replica)
βββ Writes locally
βββ Replicates to Replica 1 (next node on ring)
βββ Replicates to Replica 2 (node after that)
Quorum-based reads and writes:
With N=3 replicas:
W = 2(write quorum): write must succeed on 2 nodes β strong durabilityR = 2(read quorum): read from 2 nodes β detect stale dataW + R > N(2+2 > 3) β guaranteed to overlap β strong consistency
For availability-first (eventual consistency):
W = 1, R = 1β fast, but may read stale data- DynamoDB-style:
W = 1, R = 1by default, withR = Navailable for strong reads
class KVStore {
constructor(ring, W = 2, R = 2) {
this.ring = ring;
this.W = W; // Write quorum
this.R = R; // Read quorum
this.N = 3; // Replication factor
}
async put(key, value) {
const nodes = this.ring.getNodesForKey(key, this.N);
const timestamp = Date.now();
// Write to N nodes, wait for W acknowledgments
const writePromises = nodes.map(({ nodeInfo }) =>
this.writeToNode(nodeInfo, key, value, timestamp)
);
const results = await Promise.allSettled(writePromises);
const successes = results.filter(r => r.status === 'fulfilled');
if (successes.length < this.W) {
throw new Error(`Write failed: only ${successes.length}/${this.W} acks received`);
}
return { success: true, timestamp };
}
async get(key) {
const nodes = this.ring.getNodesForKey(key, this.N);
// Read from N nodes, wait for R responses
const readPromises = nodes.slice(0, this.R).map(({ nodeInfo }) =>
this.readFromNode(nodeInfo, key)
);
const results = await Promise.all(readPromises);
// Return the most recent version (by timestamp)
const latest = results.reduce((max, curr) =>
curr?.timestamp > max?.timestamp ? curr : max
, results[0]);
// Read repair: if any replica has stale data, update it
this.readRepair(key, latest, results, nodes);
return latest?.value;
}
async readRepair(key, latestValue, readResults, allNodes) {
// Fire-and-forget: update stale replicas in the background
for (const node of allNodes) {
const nodeResult = readResults.find(r => r?.nodeId === node.nodeInfo.id);
if (!nodeResult || nodeResult.timestamp < latestValue.timestamp) {
this.writeToNode(node.nodeInfo, key, latestValue.value, latestValue.timestamp)
.catch(console.error);
}
}
}
}
Conflict Resolution: Vector Clocksβ
When network partitions cause divergent updates, how do you resolve conflicts?
Problem:
Time 0: value = "Alice" (version 1)
Network partition occurs
Node A updates: value = "Alice Smith" (version 2a)
Node B updates: value = "A. Smith" (version 2b)
Partition heals: which version wins?
Vector clocks track causality:
{A: 1, B: 0} β Alice (initial write on node A)
{A: 2, B: 0} β Alice Smith (A updated)
{A: 1, B: 1} β A. Smith (B updated, didn't see A's update)
These are concurrent versions (neither is "before" the other)
β Application-level conflict resolution needed
β DynamoDB returns both and lets the client resolve
β Last-Write-Wins (using timestamps) is simpler but loses data
Storage Engine: LSM Treeβ
For write-heavy workloads, use a Log-Structured Merge Tree (like RocksDB):
Write path:
Write β WAL (Write-Ahead Log) β MemTable (in-memory sorted table)
β when MemTable is full
SSTable (Sorted String Table, immutable on disk)
Read path:
Check MemTable β Check recent SSTables β Check older SSTables
(Bloom filters tell us which SSTables to skip)
βββββββββββββββββββ
β MemTable β β All writes go here first (fast)
β (in-memory) β
ββββββββββ¬βββββββββ
β flush when full
βΌ
ββββββββββββββββββββββββββββββββββββββββββ
β Level 0 SSTables β β Recently flushed
β SST1.sst SST2.sst SST3.sst β
ββββββββββββββββββββββ¬ββββββββββββββββββββ
β compaction
βββββββββββββΌβββββββββββββββββββββββ
β Level 1 SSTables β β Compacted, sorted
βββββββββββββββββββββββββββββββββββββ
β compaction
βββββββββββββΌββββββββββ βββββββββββββ
β Level 2 SSTables β β Long-term storage
βββββββββββββββββββββββββββββββββββββ
Bloom filter: A probabilistic data structure that tells you "this key is definitely NOT in this SSTable" or "this key might be in this SSTable." Eliminates most unnecessary disk reads.
Full System Designβ
ββββββββββββββββββββββββββββββββββββββββ
β Client SDK β
β - Consistent hashing β
β - Request routing β
β - Retry logic β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββββββββββββΌβββββββββββββββββββββββ
β β β
ββββββββββββΌβββββββ βββββββββββΌββββββ ββββββββββββββΌβββββ
β Node 1 β β Node 2 β β Node 3 β
β - LSM store β β - LSM store β β - LSM store β
β - Replication β β - Replicationβ β - Replication β
β - Gossip β β - Gossip β β - Gossip β
βββββββββββββββββββ βββββββββββββββββ βββββββββββββββββββ
Each node runs:
- Storage engine (LSM/RocksDB)
- Replication handler (sync writes to N-1 replicas)
- Gossip protocol (cluster membership, failure detection)
- Anti-entropy (background repair of inconsistencies)
Interview Follow-up Questionsβ
Q: How do you handle node failures?
Hinted handoff: if a target replica is down, another node temporarily stores the data with a hint that it should be forwarded to the failed node when it recovers. Anti-entropy (Merkle tree comparison) repairs divergent replicas.
Q: How do you detect failures?
Use a gossip protocol. Each node periodically sends heartbeats to random nodes. If a node misses N consecutive heartbeats, it's marked as suspect, then failed. This is O(log N) convergence time.
Q: What's the difference between strong and eventual consistency?
Strong consistency: every read sees the most recent write (requires W+R > N). Eventual consistency: replicas may temporarily diverge but will converge given time (W=1, R=1). Eventual consistency offers lower latency and higher availability at the cost of stale reads.
Q: How is this different from Redis?
Redis is primarily single-threaded, in-memory, with optional persistence. Our design is a distributed, durable key-value store with tunable consistency (more like DynamoDB or Apache Cassandra). Redis is optimized for low-latency caching; our store is optimized for durability and distribution.