Skip to main content

Design a Distributed Key-Value Store

Building a distributed key-value store from scratch is a deep system design question that tests your knowledge of distributed systems fundamentals. It covers consistent hashing, replication, consensus, and the CAP theorem.

Requirements​

Functional:

  • put(key, value) β€” Store a key-value pair
  • get(key) β†’ value β€” Retrieve a value by key
  • delete(key) β€” Remove a key
  • Keys and values are strings (values up to 10 MB)

Non-functional:

  • 10B keys, average value size: 1 KB
  • 100,000 reads/sec, 10,000 writes/sec
  • High availability (99.99%)
  • Tunable consistency
  • Eventual consistency acceptable for reads

Scale Estimation​

Total data: 10B keys Γ— 1 KB avg value = 10 TB
With 3Γ— replication: 30 TB total storage

If each node holds 500 GB β†’ need 60 nodes
Reads: 100,000/sec distributed across nodes β†’ ~1,700 reads/node/sec
Writes: 10,000/sec β†’ ~167 writes/node/sec (manageable)

Key Design Decisions​

Data Partitioning: Consistent Hashing​

With 60 nodes, how do you decide which node stores key="user:123"?

Naive approach: node = hash(key) % num_nodes

Problem: Adding/removing a node remaps almost every key β†’ massive data movement.

Consistent hashing:

  • Place nodes on a "ring" of hash space (0 to 2^32)
  • Each key maps to the next node clockwise on the ring
  • Adding/removing a node only remaps ~1/N of keys
                 Node A (hash: 0)
β•±
Node D β•³ key: "foo" (hash: 85)
(hash: 90) β•² β†’ maps to Node A
╰─── Node B (hash: 100)
|
Node C (hash: 200)

When key "foo" is hashed to 85:
β†’ Next node clockwise is Node A (hash: 100... wait, A is at 0)
β†’ So it maps to the next node: B at 100

Virtual nodes β€” Each physical node has multiple positions on the ring (e.g., 150 virtual nodes per physical node). This ensures even distribution even with heterogeneous hardware.

class ConsistentHashRing {
constructor(replicationFactor = 150) {
this.replicationFactor = replicationFactor;
this.ring = new SortedMap(); // hash β†’ node
this.nodes = new Map(); // nodeName β†’ nodeInfo
}

addNode(nodeName, nodeInfo) {
for (let i = 0; i < this.replicationFactor; i++) {
const vnodeKey = `${nodeName}:vnode:${i}`;
const hash = this.hash(vnodeKey);
this.ring.set(hash, nodeName);
}
this.nodes.set(nodeName, nodeInfo);
}

removeNode(nodeName) {
for (let i = 0; i < this.replicationFactor; i++) {
const vnodeKey = `${nodeName}:vnode:${i}`;
const hash = this.hash(vnodeKey);
this.ring.delete(hash);
}
this.nodes.delete(nodeName);
}

getNodesForKey(key, count = 3) {
const keyHash = this.hash(key);
const nodes = [];
const seen = new Set();

// Walk clockwise from key's position, collecting `count` unique nodes
for (const [hash, nodeName] of this.ring.from(keyHash)) {
if (!seen.has(nodeName)) {
nodes.push({ nodeName, nodeInfo: this.nodes.get(nodeName) });
seen.add(nodeName);
if (nodes.length === count) break;
}
}

// Wrap around if needed
if (nodes.length < count) {
for (const [hash, nodeName] of this.ring) {
if (!seen.has(nodeName)) {
nodes.push({ nodeName, nodeInfo: this.nodes.get(nodeName) });
seen.add(nodeName);
if (nodes.length === count) break;
}
}
}

return nodes;
}

hash(key) {
// Use murmur3 or fnv1a for uniform distribution
return murmur3(key) >>> 0; // Unsigned 32-bit integer
}
}

Replication​

Store each key on N nodes (typically N=3) for durability and availability.

Client writes key "user:123"
β”‚
β–Ό
Coordinator node (primary replica)
β”œβ”€β”€ Writes locally
β”œβ”€β”€ Replicates to Replica 1 (next node on ring)
└── Replicates to Replica 2 (node after that)

Quorum-based reads and writes:

With N=3 replicas:

  • W = 2 (write quorum): write must succeed on 2 nodes β†’ strong durability
  • R = 2 (read quorum): read from 2 nodes β†’ detect stale data
  • W + R > N (2+2 > 3) β†’ guaranteed to overlap β†’ strong consistency

For availability-first (eventual consistency):

  • W = 1, R = 1 β†’ fast, but may read stale data
  • DynamoDB-style: W = 1, R = 1 by default, with R = N available for strong reads
class KVStore {
constructor(ring, W = 2, R = 2) {
this.ring = ring;
this.W = W; // Write quorum
this.R = R; // Read quorum
this.N = 3; // Replication factor
}

async put(key, value) {
const nodes = this.ring.getNodesForKey(key, this.N);
const timestamp = Date.now();

// Write to N nodes, wait for W acknowledgments
const writePromises = nodes.map(({ nodeInfo }) =>
this.writeToNode(nodeInfo, key, value, timestamp)
);

const results = await Promise.allSettled(writePromises);
const successes = results.filter(r => r.status === 'fulfilled');

if (successes.length < this.W) {
throw new Error(`Write failed: only ${successes.length}/${this.W} acks received`);
}

return { success: true, timestamp };
}

async get(key) {
const nodes = this.ring.getNodesForKey(key, this.N);

// Read from N nodes, wait for R responses
const readPromises = nodes.slice(0, this.R).map(({ nodeInfo }) =>
this.readFromNode(nodeInfo, key)
);

const results = await Promise.all(readPromises);

// Return the most recent version (by timestamp)
const latest = results.reduce((max, curr) =>
curr?.timestamp > max?.timestamp ? curr : max
, results[0]);

// Read repair: if any replica has stale data, update it
this.readRepair(key, latest, results, nodes);

return latest?.value;
}

async readRepair(key, latestValue, readResults, allNodes) {
// Fire-and-forget: update stale replicas in the background
for (const node of allNodes) {
const nodeResult = readResults.find(r => r?.nodeId === node.nodeInfo.id);
if (!nodeResult || nodeResult.timestamp < latestValue.timestamp) {
this.writeToNode(node.nodeInfo, key, latestValue.value, latestValue.timestamp)
.catch(console.error);
}
}
}
}

Conflict Resolution: Vector Clocks​

When network partitions cause divergent updates, how do you resolve conflicts?

Problem:

Time 0: value = "Alice" (version 1)
Network partition occurs
Node A updates: value = "Alice Smith" (version 2a)
Node B updates: value = "A. Smith" (version 2b)
Partition heals: which version wins?

Vector clocks track causality:

{A: 1, B: 0} β†’ Alice (initial write on node A)
{A: 2, B: 0} β†’ Alice Smith (A updated)
{A: 1, B: 1} β†’ A. Smith (B updated, didn't see A's update)

These are concurrent versions (neither is "before" the other)
β†’ Application-level conflict resolution needed
β†’ DynamoDB returns both and lets the client resolve
β†’ Last-Write-Wins (using timestamps) is simpler but loses data

Storage Engine: LSM Tree​

For write-heavy workloads, use a Log-Structured Merge Tree (like RocksDB):

Write path:
Write β†’ WAL (Write-Ahead Log) β†’ MemTable (in-memory sorted table)
↓ when MemTable is full
SSTable (Sorted String Table, immutable on disk)

Read path:
Check MemTable β†’ Check recent SSTables β†’ Check older SSTables
(Bloom filters tell us which SSTables to skip)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MemTable β”‚ ← All writes go here first (fast)
β”‚ (in-memory) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ flush when full
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 0 SSTables β”‚ ← Recently flushed
β”‚ SST1.sst SST2.sst SST3.sst β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ compaction
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 1 SSTables β”‚ ← Compacted, sorted
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ compaction
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 2 SSTables β”‚ ← Long-term storage
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bloom filter: A probabilistic data structure that tells you "this key is definitely NOT in this SSTable" or "this key might be in this SSTable." Eliminates most unnecessary disk reads.


Full System Design​

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client SDK β”‚
β”‚ - Consistent hashing β”‚
β”‚ - Request routing β”‚
β”‚ - Retry logic β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ Node 1 β”‚ β”‚ Node 2 β”‚ β”‚ Node 3 β”‚
β”‚ - LSM store β”‚ β”‚ - LSM store β”‚ β”‚ - LSM store β”‚
β”‚ - Replication β”‚ β”‚ - Replicationβ”‚ β”‚ - Replication β”‚
β”‚ - Gossip β”‚ β”‚ - Gossip β”‚ β”‚ - Gossip β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each node runs:
- Storage engine (LSM/RocksDB)
- Replication handler (sync writes to N-1 replicas)
- Gossip protocol (cluster membership, failure detection)
- Anti-entropy (background repair of inconsistencies)

Interview Follow-up Questions​

Q: How do you handle node failures?

Hinted handoff: if a target replica is down, another node temporarily stores the data with a hint that it should be forwarded to the failed node when it recovers. Anti-entropy (Merkle tree comparison) repairs divergent replicas.

Q: How do you detect failures?

Use a gossip protocol. Each node periodically sends heartbeats to random nodes. If a node misses N consecutive heartbeats, it's marked as suspect, then failed. This is O(log N) convergence time.

Q: What's the difference between strong and eventual consistency?

Strong consistency: every read sees the most recent write (requires W+R > N). Eventual consistency: replicas may temporarily diverge but will converge given time (W=1, R=1). Eventual consistency offers lower latency and higher availability at the cost of stale reads.

Q: How is this different from Redis?

Redis is primarily single-threaded, in-memory, with optional persistence. Our design is a distributed, durable key-value store with tunable consistency (more like DynamoDB or Apache Cassandra). Redis is optimized for low-latency caching; our store is optimized for durability and distribution.