Inverted Indexes
TL;DR
An inverted index maps terms to the documents containing them, enabling fast full-text search. Instead of scanning every document, you look up the term and get all matching documents instantly. This is the foundation of search engines like Elasticsearch, Solr, and Lucene.
The Problem Inverted Indexes Solve
Forward Index (Traditional Database)
Document → Terms
┌─────────────────────────────────────────────────────────────┐
│ Doc1: "The quick brown fox jumps over the lazy dog" │
│ Doc2: "The quick brown dog runs in the park" │
│ Doc3: "A lazy cat sleeps on the couch" │
└─────────────────────────────────────────────────────────────┘
Query: Find documents containing "lazy"
Process:
1. Scan Doc1: "The quick brown fox..." → Contains "lazy"? YES
2. Scan Doc2: "The quick brown dog..." → Contains "lazy"? NO
3. Scan Doc3: "A lazy cat sleeps..." → Contains "lazy"? YES
Time complexity: O(N × M) where N = documents, M = avg document length
For 1 billion documents: SLOWInverted Index (Search Engine)
Term → Documents
┌─────────────────────────────────────────────────────────────┐
│ "lazy" → [Doc1, Doc3] │
│ "quick" → [Doc1, Doc2] │
│ "brown" → [Doc1, Doc2] │
│ "dog" → [Doc1, Doc2] │
│ "cat" → [Doc3] │
│ "fox" → [Doc1] │
│ ... │
└─────────────────────────────────────────────────────────────┘
Query: Find documents containing "lazy"
Process:
1. Look up "lazy" in index → [Doc1, Doc3]
Time complexity: O(1) lookup + O(k) where k = matching documents
For 1 billion documents: FASTIndex Structure
Basic Inverted Index
┌─────────────────────────────────────────────────────────────────┐
│ INVERTED INDEX │
│ │
│ ┌─────────────┐ ┌──────────────────────────────────────┐ │
│ │ Dictionary │ │ Postings Lists │ │
│ │ (Terms) │ │ │ │
│ ├─────────────┤ ├──────────────────────────────────────┤ │
│ │ "apple" ──┼────►│ [1, 5, 23, 45, 67, 89, ...] │ │
│ │ "banana" ──┼────►│ [2, 34, 56, 78, ...] │ │
│ │ "cherry" ──┼────►│ [3, 12, 45, ...] │ │
│ │ "date" ──┼────►│ [7, 89, 234, ...] │ │
│ │ ... │ │ │ │
│ └─────────────┘ └──────────────────────────────────────┘ │
│ │
│ Dictionary: B-tree or Hash table for O(1) or O(log n) lookup │
│ Postings: Sorted document IDs for efficient intersection │
└─────────────────────────────────────────────────────────────────┘Enhanced Postings List
Term: "search"
Simple posting: [1, 5, 23, 45]
Enhanced posting with positions (for phrase queries):
[
(doc_id=1, positions=[5, 23, 89]), # "search" appears at word 5, 23, 89
(doc_id=5, positions=[12]),
(doc_id=23, positions=[1, 45, 67]),
(doc_id=45, positions=[33])
]
Enhanced posting with term frequency (for relevance scoring):
[
(doc_id=1, tf=3, positions=[5, 23, 89]),
(doc_id=5, tf=1, positions=[12]),
(doc_id=23, tf=3, positions=[1, 45, 67]),
(doc_id=45, tf=1, positions=[33])
]Index Construction
Basic Algorithm
python
from collections import defaultdict
import re
class InvertedIndex:
def __init__(self):
self.index = defaultdict(list)
self.documents = {}
def add_document(self, doc_id, text):
# Store original document
self.documents[doc_id] = text
# Tokenize
tokens = self.tokenize(text)
# Build index
for position, token in enumerate(tokens):
# Normalize
term = self.normalize(token)
# Add to postings list
self.index[term].append({
'doc_id': doc_id,
'position': position
})
def tokenize(self, text):
"""Split text into tokens"""
return re.findall(r'\b\w+\b', text.lower())
def normalize(self, token):
"""Normalize token (lowercase, stem, etc.)"""
return token.lower()
def search(self, query):
"""Search for documents containing query term"""
term = self.normalize(query)
postings = self.index.get(term, [])
return [p['doc_id'] for p in postings]Scalable Index Construction (MapReduce)
python
# Map phase: Emit (term, doc_id) pairs
def map(doc_id, document):
tokens = tokenize(document)
for position, token in enumerate(tokens):
term = normalize(token)
emit(term, (doc_id, position))
# Reduce phase: Build postings list for each term
def reduce(term, doc_positions):
postings = sorted(doc_positions, key=lambda x: x[0]) # Sort by doc_id
emit(term, postings)
# Example:
# Doc1: "the cat sat"
# Doc2: "the dog ran"
#
# Map output:
# ("the", (1, 0)), ("cat", (1, 1)), ("sat", (1, 2))
# ("the", (2, 0)), ("dog", (2, 1)), ("ran", (2, 2))
#
# Shuffle (group by key):
# "the" → [(1, 0), (2, 0)]
# "cat" → [(1, 1)]
# ...
#
# Reduce output:
# "the" → [(1, 0), (2, 0)]
# "cat" → [(1, 1)]Query Processing
Single Term Query
python
def search_single_term(self, term):
normalized = self.normalize(term)
return self.index.get(normalized, [])
# Query: "lazy"
# Result: [Doc1, Doc3]Boolean Queries (AND, OR, NOT)
python
def search_boolean(self, terms, operator='AND'):
if not terms:
return []
# Get postings for first term
result = set(self.search_single_term(terms[0]))
for term in terms[1:]:
term_docs = set(self.search_single_term(term))
if operator == 'AND':
result = result.intersection(term_docs)
elif operator == 'OR':
result = result.union(term_docs)
elif operator == 'NOT':
result = result.difference(term_docs)
return list(result)
# Query: "quick AND brown"
# Postings: quick → [Doc1, Doc2], brown → [Doc1, Doc2]
# Result: [Doc1, Doc2] (intersection)
# Query: "lazy OR quick"
# Postings: lazy → [Doc1, Doc3], quick → [Doc1, Doc2]
# Result: [Doc1, Doc2, Doc3] (union)Phrase Queries
python
def search_phrase(self, phrase):
"""Find documents where terms appear consecutively"""
terms = self.tokenize(phrase)
if not terms:
return []
# Get postings with positions
postings = [self.get_postings_with_positions(t) for t in terms]
# Find documents containing all terms
candidate_docs = set.intersection(*[set(p.keys()) for p in postings])
results = []
for doc_id in candidate_docs:
# Check if terms appear consecutively
if self.terms_consecutive(doc_id, postings):
results.append(doc_id)
return results
def terms_consecutive(self, doc_id, postings):
"""Check if term positions are consecutive"""
# Get positions for each term in this document
positions_list = [p[doc_id] for p in postings]
# Check if there's a sequence where positions differ by 1
for start_pos in positions_list[0]:
found = True
current_pos = start_pos
for positions in positions_list[1:]:
current_pos += 1
if current_pos not in positions:
found = False
break
if found:
return True
return False
# Query: "quick brown"
# Doc1 positions: quick→[1], brown→[2]
# Positions differ by 1 → Match!Wildcard Queries
python
# Permuterm Index for wildcard queries
# For term "hello": store h$ello, e$hellO, l$helo, l$hel, o$hell, $hello
# Query "hel*" → look up "$hel" (rotate until * is at end)
class PermutermIndex:
def __init__(self):
self.permuterm = {} # permuterm → original term
def add_term(self, term):
term_with_marker = term + '$'
for i in range(len(term_with_marker)):
rotated = term_with_marker[i:] + term_with_marker[:i]
self.permuterm[rotated] = term
def search_wildcard(self, pattern):
"""Search for pattern with * wildcard"""
# Rotate pattern until * is at the end
pattern = pattern + '$'
while not pattern.endswith('*'):
pattern = pattern[-1] + pattern[:-1]
# Remove * and search
prefix = pattern[:-1]
# Find all terms matching prefix
matches = []
for permuterm, term in self.permuterm.items():
if permuterm.startswith(prefix):
matches.append(term)
return matchesIndex Compression
Gap Encoding
Original posting list: [1, 5, 23, 45, 67, 89]
Gap encoded: [1, 4, 18, 22, 22, 22]
First value stored as-is, subsequent values as difference from previous
1, 5-1=4, 23-5=18, 45-23=22, 67-45=22, 89-67=22
Benefits:
- Gaps are usually small numbers
- Small numbers compress betterVariable Byte Encoding
Encode integers using variable number of bytes
High bit = 1 means more bytes follow
Number 5: 00000101 (1 byte)
Number 130: 10000001 00000010 (2 bytes)
↑ ↑
continue final
Savings for posting list [1, 5, 23, 45, 67, 89]:
Fixed 4-byte integers: 24 bytes
Variable byte encoded: 6 bytes (each gap fits in 1 byte)Posting List Compression Example
python
class CompressedPostings:
def compress(self, doc_ids):
"""Compress sorted doc IDs using gap encoding + variable byte"""
if not doc_ids:
return bytes()
result = []
prev = 0
for doc_id in sorted(doc_ids):
gap = doc_id - prev
result.extend(self.encode_vbyte(gap))
prev = doc_id
return bytes(result)
def encode_vbyte(self, n):
"""Variable byte encoding"""
bytes_list = []
while n >= 128:
bytes_list.append((n & 0x7F) | 0x80) # Set high bit
n >>= 7
bytes_list.append(n) # Final byte (high bit = 0)
return bytes_list
def decompress(self, compressed):
"""Decompress back to doc IDs"""
doc_ids = []
current_doc = 0
i = 0
while i < len(compressed):
gap, bytes_read = self.decode_vbyte(compressed, i)
current_doc += gap
doc_ids.append(current_doc)
i += bytes_read
return doc_idsLucene Segment Architecture
Segment Structure
┌─────────────────────────────────────────────────────────────────┐
│ Lucene Index │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Segment 0 │ │ Segment 1 │ │ Segment 2 │ │
│ │ (immutable) │ │ (immutable) │ │ (immutable) │ ... │
│ │ │ │ │ │ │ │
│ │ • Term Dict │ │ • Term Dict │ │ • Term Dict │ │
│ │ • Postings │ │ • Postings │ │ • Postings │ │
│ │ • Stored Flds │ │ • Stored Flds │ │ • Stored Flds │ │
│ │ • Doc Values │ │ • Doc Values │ │ • Doc Values │ │
│ │ • Norms │ │ • Norms │ │ • Norms │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ Commits: List of segments to include in searches │
│ Deletes: Bitmap of deleted docs per segment │
└─────────────────────────────────────────────────────────────────┘
Write path:
1. Documents buffered in memory
2. Flush to new segment when buffer full
3. Segments are immutable once written
Delete path:
1. Mark document as deleted in bitmap
2. Document still exists in segment
3. Filtered out during search
4. Removed during segment mergeSegment Merging
Before merge:
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Seg 0 │ │ Seg 1 │ │ Seg 2 │ │ Seg 3 │ │ Seg 4 │
│ 100 docs│ │ 100 docs│ │ 100 docs│ │ 100 docs│ │ 100 docs│
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
Merge policy triggers (e.g., TieredMergePolicy):
After merge:
┌───────────────────────────────────────────┐ ┌────────┐
│ Segment 5 │ │ Seg 4 │
│ 400 docs │ │ 100 docs│
│ (merged from Seg 0-3, deletes purged) │ │ │
└───────────────────────────────────────────┘ └────────┘
Benefits of merging:
- Reclaim space from deleted docs
- Reduce number of segments to search
- Improve compression
Costs:
- I/O intensive
- Temporary disk spaceDistributed Indexing
Sharding Strategies
Document-based sharding (Elasticsearch default):
┌─────────────────────────────────────────────────────────────────┐
│ Index │
│ │
│ Shard 0 Shard 1 Shard 2 │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Doc1 │ │ Doc4 │ │ Doc7 │ │
│ │ Doc2 │ │ Doc5 │ │ Doc8 │ │
│ │ Doc3 │ │ Doc6 │ │ Doc9 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Query goes to ALL shards, results merged │
└─────────────────────────────────────────────────────────────────┘
Term-based sharding (less common):
┌─────────────────────────────────────────────────────────────────┐
│ Index │
│ │
│ Shard 0 (A-H) Shard 1 (I-P) Shard 2 (Q-Z) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ apple → ... │ │ index → ... │ │ query → ... │ │
│ │ banana → ... │ │ java → ... │ │ search → ... │ │
│ │ cat → ... │ │ python → ... │ │ zebra → ... │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Query goes to specific shard(s) based on query terms │
└─────────────────────────────────────────────────────────────────┘Query Execution
Client Query: "search engine"
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Coordinating Node │
│ │
│ 1. Parse query │
│ 2. Determine which shards to query │
│ 3. Scatter request to shards │
└─────────────────────┬──────────────────────┬───────────────────┘
│ │
┌────────────┴──────────┐ ┌────────┴────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 0 │ │ Shard 1 │ │ Shard 2 │
│ │ │ │ │ │
│ Local │ │ Local │ │ Local │
│ search │ │ search │ │ search │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
│ Top 10 results │ Top 10 results │
└──────────────────────┼──────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Coordinating Node │
│ │
│ 4. Gather results from all shards │
│ 5. Merge and re-rank (global top 10) │
│ 6. Fetch documents for top results │
│ 7. Return to client │
└─────────────────────────────────────────────────────────────────┘Best Practices
Index Design
1. Choose appropriate analyzers
- Language-specific for text content
- Keyword for exact match fields (IDs, status)
2. Map field types correctly
- text: Full-text searchable
- keyword: Exact match, aggregations
- numeric: Range queries, sorting
3. Denormalize for search
- Avoid joins at query time
- Store related data together
4. Use doc values for sorting/aggregations
- Column-oriented storage
- Much faster than fielddataQuery Optimization
1. Use filters for non-scoring criteria
- Filters are cached
- Don't contribute to relevance score
2. Limit returned fields
- Only fetch what you need
- Use _source filtering
3. Paginate with search_after, not offset
- Offset requires computing all preceding results
- search_after is O(1)
4. Use routing for related documents
- Colocate documents that are searched together
- Reduces fan-out to shards