Design a Search Engine
Building a search engine like Google is one of the most complex system design problems. It involves web crawling, inverted indexes, ranking algorithms, and serving billions of queries per day.
Requirementsโ
Functional:
- Crawl the web and index web pages
- Users can search and get ranked results
- Results should be fresh (recently crawled)
- Support phrase queries and boolean operators
Non-functional:
- 5B web pages indexed
- 8.5B searches/day (~100,000 queries/sec)
- Query latency < 200ms
- Index freshness: major sites within hours, long tail within days
The Three Core Componentsโ
1. Web Crawlerโ
The crawler discovers and downloads web pages.
class WebCrawler {
constructor() {
this.frontier = new PriorityQueue(); // URLs to crawl, sorted by priority
this.visited = new BloomFilter(); // Probabilistic duplicate detection
this.robots = new RobotsCache(); // Respect robots.txt
}
async crawl() {
while (!this.frontier.isEmpty()) {
const url = this.frontier.dequeue();
if (this.visited.has(url)) continue;
if (!this.robots.isAllowed(url)) continue;
try {
const { content, links, headers } = await this.fetchPage(url);
// Store page content
await pageStore.save({
url,
content,
contentHash: md5(content),
crawledAt: new Date(),
contentType: headers['content-type']
});
// Extract and enqueue links
for (const link of links) {
if (!this.visited.has(link)) {
const priority = this.calculatePriority(link);
this.frontier.enqueue(link, priority);
}
}
this.visited.add(url);
} catch (error) {
// Log error, move on
}
}
}
calculatePriority(url) {
// Higher priority for:
// - Pages linked to from many other pages (link count)
// - Fresh pages (news sites, etc.)
// - Previously high-traffic pages
return linkCount(url) * 0.5 + freshnessScore(url) * 0.3 + trafficScore(url) * 0.2;
}
}
Politeness: Respect robots.txt, add delays between requests to same domain, use proper User-Agent.
Scale: Google runs millions of crawler instances distributed across many datacenters.
2. Inverted Indexโ
An inverted index maps each word to the list of documents that contain it.
Forward index (what a document contains):
doc1 โ ["the", "quick", "brown", "fox"]
doc2 โ ["the", "fox", "jumps"]
Inverted index (which documents contain a word):
"the" โ [doc1, doc2]
"fox" โ [doc1, doc2]
"quick" โ [doc1]
"brown" โ [doc1]
"jumps" โ [doc2]
class InvertedIndex {
constructor() {
this.index = new Map(); // word โ PostingList
}
addDocument(docId, text) {
const tokens = tokenize(text); // Split, lowercase
const stemmed = tokens.map(stem); // "running" โ "run"
const filtered = stemmed.filter(t => !STOP_WORDS.has(t)); // Remove "the", "a", etc.
// Count term frequency
const termFreq = {};
for (const token of filtered) {
termFreq[token] = (termFreq[token] || 0) + 1;
}
// Add to posting lists
for (const [term, freq] of Object.entries(termFreq)) {
if (!this.index.has(term)) this.index.set(term, []);
this.index.get(term).push({
docId,
freq, // Term frequency in this doc
positions: [] // For phrase queries
});
}
}
search(query) {
const terms = tokenize(query).map(stem);
if (terms.length === 1) {
return this.index.get(terms[0]) || [];
}
// AND query: intersect posting lists
const postingLists = terms.map(t => this.index.get(t) || []);
return this.intersect(postingLists);
}
intersect(postingLists) {
// Merge posting lists (like merge in merge sort)
// Start with shortest list for efficiency
const sorted = postingLists.sort((a, b) => a.length - b.length);
let result = new Set(sorted[0].map(p => p.docId));
for (let i = 1; i < sorted.length; i++) {
const docIds = new Set(sorted[i].map(p => p.docId));
result = new Set([...result].filter(id => docIds.has(id)));
}
return [...result];
}
}
At Google's scale: The index is too large for one machine. It's sharded:
- Document shards: Split the web into chunks, each shard indexes a subset of pages
- Term shards: Split by term โ all documents containing "apple" are on shard A
3. Ranking (PageRank + ML)โ
Step 1: PageRank (link analysis)
A page's rank = sum of the ranks of pages linking to it.
function pageRank(graph, numIterations = 20, dampingFactor = 0.85) {
const N = Object.keys(graph).length;
let ranks = {};
// Initialize all pages with equal rank
for (const page of Object.keys(graph)) {
ranks[page] = 1 / N;
}
for (let i = 0; i < numIterations; i++) {
const newRanks = {};
for (const page of Object.keys(graph)) {
let rank = (1 - dampingFactor) / N;
// Sum contributions from all pages linking to this one
for (const linker of getPagesThatLinkTo(page, graph)) {
const linkerOutLinks = graph[linker].length;
rank += dampingFactor * (ranks[linker] / linkerOutLinks);
}
newRanks[page] = rank;
}
ranks = newRanks;
}
return ranks;
}
Step 2: TF-IDF scoring (query relevance)
TF-IDF = Term Frequency ร Inverse Document Frequency
TF = (occurrences of term in document) / (total terms in document)
IDF = log(total documents / documents containing term)
High TF-IDF: Term appears often in this doc but rarely across all docs โ very relevant
Step 3: Modern ML ranking
Google's actual ranking uses neural networks (RankBrain, BERT, MUM) trained on:
- Click-through data (what do users actually click?)
- Dwell time (how long do they stay?)
- Back-button clicks (did they immediately return?)
- User satisfaction signals
Serving Architectureโ
User Query
โ
โผ
[Query Parser] โ Spell check, query expansion ("car" โ also "automobile")
โ
โผ
[Index Server Cluster] โ Each server handles a shard of the index
โ Query broadcast to ALL shards
โ Each shard returns top-K local results
โผ
[Aggregator] โ Merge and re-rank results from all shards
โ
โผ
[Document Fetcher] โ Fetch snippet, title, URL for top results
โ
โผ
[Ranking Model] โ Final ML reranking (personalization, context)
โ
โผ
[Results] โ Served to user in < 200ms
Interview Follow-up Questionsโ
Q: How do you keep the index fresh?
Use a tiered approach: hot crawl (news sites, every few hours), regular crawl (popular sites, every few days), slow crawl (long-tail sites, weeks). Track page change frequency and schedule recrawls accordingly.
Q: How do you handle 100,000 queries/sec?
Replicate the index across many servers. Each query goes to one replica set. Within a replica, fan out to all shards, collect top-K per shard, merge and rerank. Use result caching for popular queries (60% of queries have been seen before).
Q: How do you prevent spam and web manipulation?
Link analysis (PageRank) naturally discourages spam. Also: sandboxing new sites, detecting link farms, manual quality raters, and ML models trained on quality signals.