Skip to main content

Design a Search Engine

Building a search engine like Google is one of the most complex system design problems. It involves web crawling, inverted indexes, ranking algorithms, and serving billions of queries per day.

Requirementsโ€‹

Functional:

  • Crawl the web and index web pages
  • Users can search and get ranked results
  • Results should be fresh (recently crawled)
  • Support phrase queries and boolean operators

Non-functional:

  • 5B web pages indexed
  • 8.5B searches/day (~100,000 queries/sec)
  • Query latency < 200ms
  • Index freshness: major sites within hours, long tail within days

The Three Core Componentsโ€‹

1. Web Crawlerโ€‹

The crawler discovers and downloads web pages.

class WebCrawler {
constructor() {
this.frontier = new PriorityQueue(); // URLs to crawl, sorted by priority
this.visited = new BloomFilter(); // Probabilistic duplicate detection
this.robots = new RobotsCache(); // Respect robots.txt
}

async crawl() {
while (!this.frontier.isEmpty()) {
const url = this.frontier.dequeue();

if (this.visited.has(url)) continue;
if (!this.robots.isAllowed(url)) continue;

try {
const { content, links, headers } = await this.fetchPage(url);

// Store page content
await pageStore.save({
url,
content,
contentHash: md5(content),
crawledAt: new Date(),
contentType: headers['content-type']
});

// Extract and enqueue links
for (const link of links) {
if (!this.visited.has(link)) {
const priority = this.calculatePriority(link);
this.frontier.enqueue(link, priority);
}
}

this.visited.add(url);

} catch (error) {
// Log error, move on
}
}
}

calculatePriority(url) {
// Higher priority for:
// - Pages linked to from many other pages (link count)
// - Fresh pages (news sites, etc.)
// - Previously high-traffic pages
return linkCount(url) * 0.5 + freshnessScore(url) * 0.3 + trafficScore(url) * 0.2;
}
}

Politeness: Respect robots.txt, add delays between requests to same domain, use proper User-Agent.

Scale: Google runs millions of crawler instances distributed across many datacenters.


2. Inverted Indexโ€‹

An inverted index maps each word to the list of documents that contain it.

Forward index (what a document contains):
doc1 โ†’ ["the", "quick", "brown", "fox"]
doc2 โ†’ ["the", "fox", "jumps"]

Inverted index (which documents contain a word):
"the" โ†’ [doc1, doc2]
"fox" โ†’ [doc1, doc2]
"quick" โ†’ [doc1]
"brown" โ†’ [doc1]
"jumps" โ†’ [doc2]
class InvertedIndex {
constructor() {
this.index = new Map(); // word โ†’ PostingList
}

addDocument(docId, text) {
const tokens = tokenize(text); // Split, lowercase
const stemmed = tokens.map(stem); // "running" โ†’ "run"
const filtered = stemmed.filter(t => !STOP_WORDS.has(t)); // Remove "the", "a", etc.

// Count term frequency
const termFreq = {};
for (const token of filtered) {
termFreq[token] = (termFreq[token] || 0) + 1;
}

// Add to posting lists
for (const [term, freq] of Object.entries(termFreq)) {
if (!this.index.has(term)) this.index.set(term, []);
this.index.get(term).push({
docId,
freq, // Term frequency in this doc
positions: [] // For phrase queries
});
}
}

search(query) {
const terms = tokenize(query).map(stem);

if (terms.length === 1) {
return this.index.get(terms[0]) || [];
}

// AND query: intersect posting lists
const postingLists = terms.map(t => this.index.get(t) || []);
return this.intersect(postingLists);
}

intersect(postingLists) {
// Merge posting lists (like merge in merge sort)
// Start with shortest list for efficiency
const sorted = postingLists.sort((a, b) => a.length - b.length);
let result = new Set(sorted[0].map(p => p.docId));

for (let i = 1; i < sorted.length; i++) {
const docIds = new Set(sorted[i].map(p => p.docId));
result = new Set([...result].filter(id => docIds.has(id)));
}

return [...result];
}
}

At Google's scale: The index is too large for one machine. It's sharded:

  • Document shards: Split the web into chunks, each shard indexes a subset of pages
  • Term shards: Split by term โ€” all documents containing "apple" are on shard A

3. Ranking (PageRank + ML)โ€‹

Step 1: PageRank (link analysis)

A page's rank = sum of the ranks of pages linking to it.

function pageRank(graph, numIterations = 20, dampingFactor = 0.85) {
const N = Object.keys(graph).length;
let ranks = {};

// Initialize all pages with equal rank
for (const page of Object.keys(graph)) {
ranks[page] = 1 / N;
}

for (let i = 0; i < numIterations; i++) {
const newRanks = {};

for (const page of Object.keys(graph)) {
let rank = (1 - dampingFactor) / N;

// Sum contributions from all pages linking to this one
for (const linker of getPagesThatLinkTo(page, graph)) {
const linkerOutLinks = graph[linker].length;
rank += dampingFactor * (ranks[linker] / linkerOutLinks);
}

newRanks[page] = rank;
}

ranks = newRanks;
}

return ranks;
}

Step 2: TF-IDF scoring (query relevance)

TF-IDF = Term Frequency ร— Inverse Document Frequency

TF = (occurrences of term in document) / (total terms in document)
IDF = log(total documents / documents containing term)

High TF-IDF: Term appears often in this doc but rarely across all docs โ†’ very relevant

Step 3: Modern ML ranking

Google's actual ranking uses neural networks (RankBrain, BERT, MUM) trained on:

  • Click-through data (what do users actually click?)
  • Dwell time (how long do they stay?)
  • Back-button clicks (did they immediately return?)
  • User satisfaction signals

Serving Architectureโ€‹

User Query
โ”‚
โ–ผ
[Query Parser] โ†’ Spell check, query expansion ("car" โ†’ also "automobile")
โ”‚
โ–ผ
[Index Server Cluster] โ†’ Each server handles a shard of the index
โ”‚ Query broadcast to ALL shards
โ”‚ Each shard returns top-K local results
โ–ผ
[Aggregator] โ†’ Merge and re-rank results from all shards
โ”‚
โ–ผ
[Document Fetcher] โ†’ Fetch snippet, title, URL for top results
โ”‚
โ–ผ
[Ranking Model] โ†’ Final ML reranking (personalization, context)
โ”‚
โ–ผ
[Results] โ†’ Served to user in < 200ms

Interview Follow-up Questionsโ€‹

Q: How do you keep the index fresh?

Use a tiered approach: hot crawl (news sites, every few hours), regular crawl (popular sites, every few days), slow crawl (long-tail sites, weeks). Track page change frequency and schedule recrawls accordingly.

Q: How do you handle 100,000 queries/sec?

Replicate the index across many servers. Each query goes to one replica set. Within a replica, fan out to all shards, collect top-K per shard, merge and rerank. Use result caching for popular queries (60% of queries have been seen before).

Q: How do you prevent spam and web manipulation?

Link analysis (PageRank) naturally discourages spam. Also: sandboxing new sites, detecting link farms, manual quality raters, and ML models trained on quality signals.