Design a Distributed Key-Value Store

Building a distributed key-value store from scratch is a deep system design question that tests your knowledge of distributed systems fundamentals. It covers consistent hashing, replication, consensus, and the CAP theorem.

Requirements

Functional:

put(key, value) — Store a key-value pair
get(key) → value — Retrieve a value by key
delete(key) — Remove a key
Keys and values are strings (values up to 10 MB)

Non-functional:

10B keys, average value size: 1 KB
100,000 reads/sec, 10,000 writes/sec
High availability (99.99%)
Tunable consistency
Eventual consistency acceptable for reads

Scale Estimation

Total data: 10B keys × 1 KB avg value = 10 TB
With 3× replication: 30 TB total storage

If each node holds 500 GB → need 60 nodes
Reads: 100,000/sec distributed across nodes → ~1,700 reads/node/sec
Writes: 10,000/sec → ~167 writes/node/sec (manageable)

Key Design Decisions

Data Partitioning: Consistent Hashing

With 60 nodes, how do you decide which node stores key="user:123"?

Naive approach: node = hash(key) % num_nodes

Problem: Adding/removing a node remaps almost every key → massive data movement.

Consistent hashing:

Place nodes on a "ring" of hash space (0 to 2^32)
Each key maps to the next node clockwise on the ring
Adding/removing a node only remaps ~1/N of keys

                 Node A (hash: 0)
               ╱
     Node D   ╳     key: "foo" (hash: 85)
   (hash: 90)  ╲   → maps to Node A
               ╰─── Node B (hash: 100)
                      |
                    Node C (hash: 200)

When key "foo" is hashed to 85:
→ Next node clockwise is Node A (hash: 100... wait, A is at 0)
→ So it maps to the next node: B at 100

Virtual nodes — Each physical node has multiple positions on the ring (e.g., 150 virtual nodes per physical node). This ensures even distribution even with heterogeneous hardware.

class ConsistentHashRing {
  constructor(replicationFactor = 150) {
    this.replicationFactor = replicationFactor;
    this.ring = new SortedMap();  // hash → node
    this.nodes = new Map();  // nodeName → nodeInfo
  }

  addNode(nodeName, nodeInfo) {
    for (let i = 0; i < this.replicationFactor; i++) {
      const vnodeKey = `${nodeName}:vnode:${i}`;
      const hash = this.hash(vnodeKey);
      this.ring.set(hash, nodeName);
    }
    this.nodes.set(nodeName, nodeInfo);
  }

  removeNode(nodeName) {
    for (let i = 0; i < this.replicationFactor; i++) {
      const vnodeKey = `${nodeName}:vnode:${i}`;
      const hash = this.hash(vnodeKey);
      this.ring.delete(hash);
    }
    this.nodes.delete(nodeName);
  }

  getNodesForKey(key, count = 3) {
    const keyHash = this.hash(key);
    const nodes = [];
    const seen = new Set();

    // Walk clockwise from key's position, collecting `count` unique nodes
    for (const [hash, nodeName] of this.ring.from(keyHash)) {
      if (!seen.has(nodeName)) {
        nodes.push({ nodeName, nodeInfo: this.nodes.get(nodeName) });
        seen.add(nodeName);
        if (nodes.length === count) break;
      }
    }

    // Wrap around if needed
    if (nodes.length < count) {
      for (const [hash, nodeName] of this.ring) {
        if (!seen.has(nodeName)) {
          nodes.push({ nodeName, nodeInfo: this.nodes.get(nodeName) });
          seen.add(nodeName);
          if (nodes.length === count) break;
        }
      }
    }

    return nodes;
  }

  hash(key) {
    // Use murmur3 or fnv1a for uniform distribution
    return murmur3(key) >>> 0;  // Unsigned 32-bit integer
  }
}

Replication

Store each key on N nodes (typically N=3) for durability and availability.

Client writes key "user:123"
    │
    ▼
Coordinator node (primary replica)
    ├── Writes locally
    ├── Replicates to Replica 1 (next node on ring)
    └── Replicates to Replica 2 (node after that)

Quorum-based reads and writes:

With N=3 replicas:

W = 2 (write quorum): write must succeed on 2 nodes → strong durability
R = 2 (read quorum): read from 2 nodes → detect stale data
W + R > N (2+2 > 3) → guaranteed to overlap → strong consistency

For availability-first (eventual consistency):

W = 1, R = 1 → fast, but may read stale data
DynamoDB-style: W = 1, R = 1 by default, with R = N available for strong reads

class KVStore {
  constructor(ring, W = 2, R = 2) {
    this.ring = ring;
    this.W = W;  // Write quorum
    this.R = R;  // Read quorum
    this.N = 3;  // Replication factor
  }

  async put(key, value) {
    const nodes = this.ring.getNodesForKey(key, this.N);
    const timestamp = Date.now();

    // Write to N nodes, wait for W acknowledgments
    const writePromises = nodes.map(({ nodeInfo }) =>
      this.writeToNode(nodeInfo, key, value, timestamp)
    );

    const results = await Promise.allSettled(writePromises);
    const successes = results.filter(r => r.status === 'fulfilled');

    if (successes.length < this.W) {
      throw new Error(`Write failed: only ${successes.length}/${this.W} acks received`);
    }

    return { success: true, timestamp };
  }

  async get(key) {
    const nodes = this.ring.getNodesForKey(key, this.N);

    // Read from N nodes, wait for R responses
    const readPromises = nodes.slice(0, this.R).map(({ nodeInfo }) =>
      this.readFromNode(nodeInfo, key)
    );

    const results = await Promise.all(readPromises);

    // Return the most recent version (by timestamp)
    const latest = results.reduce((max, curr) =>
      curr?.timestamp > max?.timestamp ? curr : max
    , results[0]);

    // Read repair: if any replica has stale data, update it
    this.readRepair(key, latest, results, nodes);

    return latest?.value;
  }

  async readRepair(key, latestValue, readResults, allNodes) {
    // Fire-and-forget: update stale replicas in the background
    for (const node of allNodes) {
      const nodeResult = readResults.find(r => r?.nodeId === node.nodeInfo.id);
      if (!nodeResult || nodeResult.timestamp < latestValue.timestamp) {
        this.writeToNode(node.nodeInfo, key, latestValue.value, latestValue.timestamp)
          .catch(console.error);
      }
    }
  }
}

Conflict Resolution: Vector Clocks

When network partitions cause divergent updates, how do you resolve conflicts?

Problem:

Time 0: value = "Alice" (version 1)
Network partition occurs
Node A updates: value = "Alice Smith" (version 2a)
Node B updates: value = "A. Smith" (version 2b)
Partition heals: which version wins?

Vector clocks track causality:

{A: 1, B: 0} → Alice (initial write on node A)
{A: 2, B: 0} → Alice Smith (A updated)
{A: 1, B: 1} → A. Smith (B updated, didn't see A's update)

These are concurrent versions (neither is "before" the other)
→ Application-level conflict resolution needed
→ DynamoDB returns both and lets the client resolve
→ Last-Write-Wins (using timestamps) is simpler but loses data

Storage Engine: LSM Tree

For write-heavy workloads, use a Log-Structured Merge Tree (like RocksDB):

Write path:
Write → WAL (Write-Ahead Log) → MemTable (in-memory sorted table)
    ↓ when MemTable is full
SSTable (Sorted String Table, immutable on disk)

Read path:
Check MemTable → Check recent SSTables → Check older SSTables
(Bloom filters tell us which SSTables to skip)

┌─────────────────┐
│    MemTable     │  ← All writes go here first (fast)
│  (in-memory)   │
└────────┬────────┘
         │ flush when full
         ▼
┌────────────────────────────────────────┐
│              Level 0 SSTables          │  ← Recently flushed
│  SST1.sst  SST2.sst  SST3.sst         │
└────────────────────┬───────────────────┘
                     │ compaction
         ┌───────────▼──────────────────────┐
         │          Level 1 SSTables        │  ← Compacted, sorted
         └───────────────────────────────────┘
                     │ compaction
         ┌───────────▼──────────────────────┐
         │          Level 2 SSTables        │  ← Long-term storage
         └───────────────────────────────────┘

Bloom filter: A probabilistic data structure that tells you "this key is definitely NOT in this SSTable" or "this key might be in this SSTable." Eliminates most unnecessary disk reads.

Full System Design

                    ┌──────────────────────────────────────┐
                    │        Client SDK                    │
                    │  - Consistent hashing                │
                    │  - Request routing                   │
                    │  - Retry logic                       │
                    └──────────────┬───────────────────────┘
                                   │
               ┌───────────────────┼──────────────────────┐
               │                   │                      │
    ┌──────────▼──────┐  ┌─────────▼─────┐  ┌────────────▼────┐
    │    Node 1       │  │    Node 2     │  │    Node 3       │
    │  - LSM store    │  │  - LSM store  │  │  - LSM store    │
    │  - Replication  │  │  - Replication│  │  - Replication  │
    │  - Gossip       │  │  - Gossip     │  │  - Gossip       │
    └─────────────────┘  └───────────────┘  └─────────────────┘

Each node runs:
- Storage engine (LSM/RocksDB)
- Replication handler (sync writes to N-1 replicas)
- Gossip protocol (cluster membership, failure detection)
- Anti-entropy (background repair of inconsistencies)

Interview Follow-up Questions

Q: How do you handle node failures?

Hinted handoff: if a target replica is down, another node temporarily stores the data with a hint that it should be forwarded to the failed node when it recovers. Anti-entropy (Merkle tree comparison) repairs divergent replicas.

Q: How do you detect failures?

Use a gossip protocol. Each node periodically sends heartbeats to random nodes. If a node misses N consecutive heartbeats, it's marked as suspect, then failed. This is O(log N) convergence time.

Q: What's the difference between strong and eventual consistency?

Strong consistency: every read sees the most recent write (requires W+R > N). Eventual consistency: replicas may temporarily diverge but will converge given time (W=1, R=1). Eventual consistency offers lower latency and higher availability at the cost of stale reads.

Q: How is this different from Redis?

Redis is primarily single-threaded, in-memory, with optional persistence. Our design is a distributed, durable key-value store with tunable consistency (more like DynamoDB or Apache Cassandra). Redis is optimized for low-latency caching; our store is optimized for durability and distribution.

Requirements​

Scale Estimation​

Key Design Decisions​

Data Partitioning: Consistent Hashing​

Replication​

Conflict Resolution: Vector Clocks​

Storage Engine: LSM Tree​

Full System Design​

Interview Follow-up Questions​