Design a Search Engine

Building a search engine like Google is one of the most complex system design problems. It involves web crawling, inverted indexes, ranking algorithms, and serving billions of queries per day.

Requirements

Functional:

Crawl the web and index web pages
Users can search and get ranked results
Results should be fresh (recently crawled)
Support phrase queries and boolean operators

Non-functional:

5B web pages indexed
8.5B searches/day (~100,000 queries/sec)
Query latency < 200ms
Index freshness: major sites within hours, long tail within days

The Three Core Components

1. Web Crawler

The crawler discovers and downloads web pages.

class WebCrawler {
  constructor() {
    this.frontier = new PriorityQueue();  // URLs to crawl, sorted by priority
    this.visited = new BloomFilter();      // Probabilistic duplicate detection
    this.robots = new RobotsCache();       // Respect robots.txt
  }

  async crawl() {
    while (!this.frontier.isEmpty()) {
      const url = this.frontier.dequeue();

      if (this.visited.has(url)) continue;
      if (!this.robots.isAllowed(url)) continue;

      try {
        const { content, links, headers } = await this.fetchPage(url);

        // Store page content
        await pageStore.save({
          url,
          content,
          contentHash: md5(content),
          crawledAt: new Date(),
          contentType: headers['content-type']
        });

        // Extract and enqueue links
        for (const link of links) {
          if (!this.visited.has(link)) {
            const priority = this.calculatePriority(link);
            this.frontier.enqueue(link, priority);
          }
        }

        this.visited.add(url);

      } catch (error) {
        // Log error, move on
      }
    }
  }

  calculatePriority(url) {
    // Higher priority for:
    // - Pages linked to from many other pages (link count)
    // - Fresh pages (news sites, etc.)
    // - Previously high-traffic pages
    return linkCount(url) * 0.5 + freshnessScore(url) * 0.3 + trafficScore(url) * 0.2;
  }
}

Politeness: Respect robots.txt, add delays between requests to same domain, use proper User-Agent.

Scale: Google runs millions of crawler instances distributed across many datacenters.

2. Inverted Index

An inverted index maps each word to the list of documents that contain it.

Forward index (what a document contains):
doc1 → ["the", "quick", "brown", "fox"]
doc2 → ["the", "fox", "jumps"]

Inverted index (which documents contain a word):
"the"   → [doc1, doc2]
"fox"   → [doc1, doc2]
"quick" → [doc1]
"brown" → [doc1]
"jumps" → [doc2]

class InvertedIndex {
  constructor() {
    this.index = new Map();  // word → PostingList
  }

  addDocument(docId, text) {
    const tokens = tokenize(text);       // Split, lowercase
    const stemmed = tokens.map(stem);    // "running" → "run"
    const filtered = stemmed.filter(t => !STOP_WORDS.has(t));  // Remove "the", "a", etc.

    // Count term frequency
    const termFreq = {};
    for (const token of filtered) {
      termFreq[token] = (termFreq[token] || 0) + 1;
    }

    // Add to posting lists
    for (const [term, freq] of Object.entries(termFreq)) {
      if (!this.index.has(term)) this.index.set(term, []);
      this.index.get(term).push({
        docId,
        freq,         // Term frequency in this doc
        positions: [] // For phrase queries
      });
    }
  }

  search(query) {
    const terms = tokenize(query).map(stem);

    if (terms.length === 1) {
      return this.index.get(terms[0]) || [];
    }

    // AND query: intersect posting lists
    const postingLists = terms.map(t => this.index.get(t) || []);
    return this.intersect(postingLists);
  }

  intersect(postingLists) {
    // Merge posting lists (like merge in merge sort)
    // Start with shortest list for efficiency
    const sorted = postingLists.sort((a, b) => a.length - b.length);
    let result = new Set(sorted[0].map(p => p.docId));

    for (let i = 1; i < sorted.length; i++) {
      const docIds = new Set(sorted[i].map(p => p.docId));
      result = new Set([...result].filter(id => docIds.has(id)));
    }

    return [...result];
  }
}

At Google's scale: The index is too large for one machine. It's sharded:

Document shards: Split the web into chunks, each shard indexes a subset of pages
Term shards: Split by term — all documents containing "apple" are on shard A

3. Ranking (PageRank + ML)

Step 1: PageRank (link analysis)

A page's rank = sum of the ranks of pages linking to it.

function pageRank(graph, numIterations = 20, dampingFactor = 0.85) {
  const N = Object.keys(graph).length;
  let ranks = {};

  // Initialize all pages with equal rank
  for (const page of Object.keys(graph)) {
    ranks[page] = 1 / N;
  }

  for (let i = 0; i < numIterations; i++) {
    const newRanks = {};

    for (const page of Object.keys(graph)) {
      let rank = (1 - dampingFactor) / N;

      // Sum contributions from all pages linking to this one
      for (const linker of getPagesThatLinkTo(page, graph)) {
        const linkerOutLinks = graph[linker].length;
        rank += dampingFactor * (ranks[linker] / linkerOutLinks);
      }

      newRanks[page] = rank;
    }

    ranks = newRanks;
  }

  return ranks;
}

Step 2: TF-IDF scoring (query relevance)

TF-IDF = Term Frequency × Inverse Document Frequency

TF = (occurrences of term in document) / (total terms in document)
IDF = log(total documents / documents containing term)

High TF-IDF: Term appears often in this doc but rarely across all docs → very relevant

Step 3: Modern ML ranking

Google's actual ranking uses neural networks (RankBrain, BERT, MUM) trained on:

Click-through data (what do users actually click?)
Dwell time (how long do they stay?)
Back-button clicks (did they immediately return?)
User satisfaction signals

Serving Architecture

User Query
    │
    ▼
[Query Parser]  → Spell check, query expansion ("car" → also "automobile")
    │
    ▼
[Index Server Cluster]  → Each server handles a shard of the index
    │                       Query broadcast to ALL shards
    │                       Each shard returns top-K local results
    ▼
[Aggregator]  → Merge and re-rank results from all shards
    │
    ▼
[Document Fetcher]  → Fetch snippet, title, URL for top results
    │
    ▼
[Ranking Model]  → Final ML reranking (personalization, context)
    │
    ▼
[Results]  → Served to user in < 200ms

Interview Follow-up Questions

Q: How do you keep the index fresh?

Use a tiered approach: hot crawl (news sites, every few hours), regular crawl (popular sites, every few days), slow crawl (long-tail sites, weeks). Track page change frequency and schedule recrawls accordingly.

Q: How do you handle 100,000 queries/sec?

Replicate the index across many servers. Each query goes to one replica set. Within a replica, fan out to all shards, collect top-K per shard, merge and rerank. Use result caching for popular queries (60% of queries have been seen before).

Q: How do you prevent spam and web manipulation?

Link analysis (PageRank) naturally discourages spam. Also: sandboxing new sites, detecting link farms, manual quality raters, and ML models trained on quality signals.

Requirements​

The Three Core Components​

1. Web Crawler​

2. Inverted Index​

3. Ranking (PageRank + ML)​

Serving Architecture​

Interview Follow-up Questions​