Bigtable: A Distributed Storage System for Structured Data

Paper Overview

Authors: Fay Chang et al. (Google)
Published: OSDI 2006
Context: Google's internal storage for web indexing, Google Earth, Google Finance

TL;DR

Bigtable is a distributed storage system for managing structured data at petabyte scale. It provides a sparse, distributed, persistent multi-dimensional sorted map indexed by (row, column, timestamp). Key innovations include tablets for horizontal scaling, SSTable-based storage with in-memory buffering, single-row transactions without full ACID, and integration with GFS and Chubby for storage and coordination. Bigtable influenced virtually all modern NoSQL databases.

Problem Statement

Google needed storage for:

Web indexing: Crawled pages, links, anchor text (petabytes)
Google Earth: Satellite imagery at multiple resolutions
Google Finance: Time-series stock data
Personalization: User preferences across products

Requirements:

Petabyte scale, billions of rows
High throughput for batch processing
Low latency for interactive serving
Self-managing, automatic scaling
Integration with MapReduce

Data Model

┌─────────────────────────────────────────────────────────────────────────┐
│                    Bigtable Data Model                                   │
│                                                                          │
│   (row_key, column_key, timestamp) → value                              │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │   Example: Web Table                                              │  │
│   │                                                                   │  │
│   │   Row Key: Reversed URL (com.google.www)                         │  │
│   │                                                                   │  │
│   │   ┌────────────────┬──────────────────┬──────────────────────┐   │  │
│   │   │    Row Key     │   contents:html  │   anchor:cnn.com     │   │  │
│   │   │                │                  │   anchor:nyt.com     │   │  │
│   │   ├────────────────┼──────────────────┼──────────────────────┤   │  │
│   │   │                │ t=5: "<html>..." │ t=9: "CNN"           │   │  │
│   │   │ com.google.www │ t=3: "<html>..." │ t=8: "click here"    │   │  │
│   │   │                │ t=1: "<html>..." │                      │   │  │
│   │   ├────────────────┼──────────────────┼──────────────────────┤   │  │
│   │   │ com.cnn.www    │ t=6: "<html>..." │ t=5: "CNN Home"      │   │  │
│   │   └────────────────┴──────────────────┴──────────────────────┘   │  │
│   │                                                                   │  │
│   │   Key Properties:                                                │  │
│   │   - Rows sorted lexicographically                               │  │
│   │   - Column families: contents, anchor                           │  │
│   │   - Columns: contents:html, anchor:cnn.com                      │  │
│   │   - Multiple versions per cell (timestamped)                    │  │
│   │   - Sparse: no storage for empty cells                          │  │
│   └──────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

Key Design Choices

┌─────────────────────────────────────────────────────────────────────────┐
│   Rows:                                                                 │
│   - Arbitrary string (up to 64KB)                                      │
│   - Single-row transactions (atomic read-modify-write)                 │
│   - Sorted by row key (locality for range scans)                       │
│                                                                          │
│   Column Families:                                                      │
│   - Groups of columns with common properties                           │
│   - Unit of access control                                             │
│   - Created at schema time (not dynamic)                               │
│   - Example: "anchor" family, "contents" family                        │
│                                                                          │
│   Columns:                                                              │
│   - Format: family:qualifier                                           │
│   - Qualifiers can be created dynamically                              │
│   - Example: anchor:cnn.com, anchor:nytimes.com                        │
│                                                                          │
│   Timestamps:                                                           │
│   - 64-bit integers (microseconds or user-defined)                     │
│   - Multiple versions per cell                                         │
│   - Garbage collection: keep last N versions or versions newer than T  │
└─────────────────────────────────────────────────────────────────────────┘

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                    Bigtable Architecture                                 │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                        Clients                                    │  │
│   │                          │                                        │  │
│   │        ┌─────────────────┴─────────────────┐                     │  │
│   │        ▼                                   ▼                     │  │
│   │   ┌─────────────┐                    ┌─────────────┐            │  │
│   │   │  Bigtable   │                    │  Bigtable   │            │  │
│   │   │  Library    │                    │  Library    │            │  │
│   │   └─────────────┘                    └─────────────┘            │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                          │                                              │
│         ┌────────────────┼────────────────┐                            │
│         ▼                ▼                ▼                            │
│   ┌──────────────────────────────────────────────────────────────────┐ │
│   │                    Tablet Servers                                 │ │
│   │                                                                   │ │
│   │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │ │
│   │   │   Tablet    │  │   Tablet    │  │   Tablet    │ ...         │ │
│   │   │   Server    │  │   Server    │  │   Server    │             │ │
│   │   └─────────────┘  └─────────────┘  └─────────────┘             │ │
│   │         │                │                │                       │ │
│   │         └────────────────┼────────────────┘                       │ │
│   │                          │                                        │ │
│   │   Each tablet server manages 10-1000 tablets                     │ │
│   │   Each tablet: 100-200 MB of data                                │ │
│   └──────────────────────────────────────────────────────────────────┘ │
│                          │                                              │
│         ┌────────────────┼────────────────┐                            │
│         ▼                ▼                ▼                            │
│   ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────────┐   │
│   │   Chubby    │  │   Master    │  │           GFS               │   │
│   │  (Locking)  │  │  (Tablet    │  │   (Persistent Storage)      │   │
│   │             │  │  Assignment)│  │                             │   │
│   └─────────────┘  └─────────────┘  └─────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

Tablets

┌─────────────────────────────────────────────────────────────────────────┐
│                    Tablet: Unit of Distribution                          │
│                                                                          │
│   A tablet is a contiguous range of rows                                │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │   Table: webtable                                                 │  │
│   │                                                                   │  │
│   │   ┌────────────────────────────────────────────────────────────┐ │  │
│   │   │ Tablet 1: rows [aaa...] to [com.facebook...]              │ │  │
│   │   │ Tablet 2: rows [com.google...] to [com.yahoo...]          │ │  │
│   │   │ Tablet 3: rows [com.youtube...] to [org.apache...]        │ │  │
│   │   │ Tablet 4: rows [org.wikipedia...] to [zzz...]             │ │  │
│   │   └────────────────────────────────────────────────────────────┘ │  │
│   │                                                                   │  │
│   │   As tablets grow, they split:                                   │  │
│   │   Tablet 2 → Tablet 2a [com.google...] to [com.twitter...]      │  │
│   │              Tablet 2b [com.uber...] to [com.yahoo...]          │  │
│   │                                                                   │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│   Benefits:                                                             │
│   - Fine-grained load balancing (move tablets between servers)         │
│   - Efficient recovery (only restore failed tablet, not whole table)   │
│   - Parallel processing (distribute tablets across cluster)            │
└─────────────────────────────────────────────────────────────────────────┘

Tablet Location

┌─────────────────────────────────────────────────────────────────────────┐
│                    Three-Level Tablet Location Hierarchy                 │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                                                                   │  │
│   │   Level 0: Chubby File                                           │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │ /bigtable/root_tablet_location → Tablet Server X        │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                          │                                        │  │
│   │                          ▼                                        │  │
│   │   Level 1: Root Tablet (part of METADATA table)                  │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │ Row: METADATA tablet key                                 │    │  │
│   │   │ Value: Location of METADATA tablet                       │    │  │
│   │   │                                                          │    │  │
│   │   │ Points to all other METADATA tablets                     │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                          │                                        │  │
│   │                          ▼                                        │  │
│   │   Level 2: Other METADATA Tablets                                │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │ Row: Table name + end row key                            │    │  │
│   │   │ Value: Tablet server location                            │    │  │
│   │   │                                                          │    │  │
│   │   │ Contains location of all user tablets                    │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                          │                                        │  │
│   │                          ▼                                        │  │
│   │   Level 3: User Tablets                                          │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │ Actual user data                                         │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                                                                   │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│   Client caches tablet locations; cache invalidated on miss            │
│   3 network round trips for cold cache, 0 for warm cache               │
└─────────────────────────────────────────────────────────────────────────┘

SSTable Storage Format

┌─────────────────────────────────────────────────────────────────────────┐
│                    SSTable (Sorted String Table)                         │
│                                                                          │
│   Immutable, sorted file of key-value pairs                             │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                     SSTable File                                  │  │
│   │                                                                   │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │                   Data Blocks                            │    │  │
│   │   │   ┌───────────┐ ┌───────────┐ ┌───────────┐             │    │  │
│   │   │   │  Block 0  │ │  Block 1  │ │  Block 2  │ ...         │    │  │
│   │   │   │  64KB     │ │  64KB     │ │  64KB     │             │    │  │
│   │   │   │ (sorted   │ │ (sorted   │ │ (sorted   │             │    │  │
│   │   │   │  k-v pairs)│ │  k-v pairs)│ │  k-v pairs)│            │    │  │
│   │   │   └───────────┘ └───────────┘ └───────────┘             │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                                                                   │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │                   Index Block                            │    │  │
│   │   │   Block 0: key "aaa" at offset 0                        │    │  │
│   │   │   Block 1: key "bbb" at offset 65536                    │    │  │
│   │   │   Block 2: key "ccc" at offset 131072                   │    │  │
│   │   │   ...                                                    │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                                                                   │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │                   Bloom Filter                           │    │  │
│   │   │   Fast "key might exist" / "key definitely doesn't"     │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                                                                   │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │                   Footer                                 │    │  │
│   │   │   Offset to index block, bloom filter                   │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│   Read path:                                                            │
│   1. Load index into memory (small)                                     │
│   2. Binary search index for block containing key                       │
│   3. Load and search that one block                                     │
└─────────────────────────────────────────────────────────────────────────┘

Tablet Serving

┌─────────────────────────────────────────────────────────────────────────┐
│                    Tablet Server Internals                               │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                     Write Path                                    │  │
│   │                                                                   │  │
│   │   1. Check authorization (tablet server handles)                 │  │
│   │   2. Write to commit log (append to GFS)                        │  │
│   │   3. Insert into memtable (sorted in-memory buffer)             │  │
│   │                                                                   │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │                    Memtable                              │    │  │
│   │   │   - Red-black tree or skip list                         │    │  │
│   │   │   - Recently written data                                │    │  │
│   │   │   - When full, frozen and flushed to SSTable            │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                                                                   │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                     Read Path                                     │  │
│   │                                                                   │  │
│   │   Merge view of:                                                 │  │
│   │   1. Memtable (newest data)                                     │  │
│   │   2. Immutable memtable (being flushed)                         │  │
│   │   3. SSTables on disk (older data)                              │  │
│   │                                                                   │  │
│   │   ┌─────────────────────────────────────────────────────────┐    │  │
│   │   │                                                          │    │  │
│   │   │    Memtable ──┐                                         │    │  │
│   │   │               │                                         │    │  │
│   │   │  Immutable ───┼───▶ Merge Iterator ───▶ Result         │    │  │
│   │   │  Memtable     │                                         │    │  │
│   │   │               │                                         │    │  │
│   │   │  SSTable 1 ───┤                                         │    │  │
│   │   │  SSTable 2 ───┤                                         │    │  │
│   │   │  SSTable 3 ───┘                                         │    │  │
│   │   │                                                          │    │  │
│   │   └─────────────────────────────────────────────────────────┘    │  │
│   │                                                                   │  │
│   └──────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

Implementation

python

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Iterator, Tuple
import time
from sortedcontainers import SortedDict

@dataclass
class Cell:
    value: bytes
    timestamp: int

@dataclass
class ColumnFamily:
    name: str
    columns: Dict[str, List[Cell]] = field(default_factory=dict)
    max_versions: int = 3


class Memtable:
    """
    In-memory sorted buffer for recent writes.
    Implemented as sorted map.
    """
    
    def __init__(self, max_size_bytes: int = 64 * 1024 * 1024):
        self.data: SortedDict = SortedDict()  # row_key -> {cf -> {col -> [cells]}}
        self.size_bytes = 0
        self.max_size = max_size_bytes
        self.frozen = False
    
    def put(
        self,
        row_key: str,
        column_family: str,
        column: str,
        value: bytes,
        timestamp: Optional[int] = None
    ):
        """Insert or update a cell"""
        if self.frozen:
            raise Exception("Memtable is frozen")
        
        if timestamp is None:
            timestamp = int(time.time() * 1000000)
        
        if row_key not in self.data:
            self.data[row_key] = {}
        
        if column_family not in self.data[row_key]:
            self.data[row_key][column_family] = {}
        
        if column not in self.data[row_key][column_family]:
            self.data[row_key][column_family][column] = []
        
        cell = Cell(value=value, timestamp=timestamp)
        self.data[row_key][column_family][column].append(cell)
        
        # Sort by timestamp descending (newest first)
        self.data[row_key][column_family][column].sort(
            key=lambda c: c.timestamp, 
            reverse=True
        )
        
        self.size_bytes += len(row_key) + len(column_family) + len(column) + len(value) + 8
    
    def get(
        self,
        row_key: str,
        column_family: Optional[str] = None,
        column: Optional[str] = None,
        max_versions: int = 1
    ) -> Optional[Dict]:
        """Get cells for a row"""
        if row_key not in self.data:
            return None
        
        row_data = self.data[row_key]
        
        if column_family and column_family in row_data:
            if column and column in row_data[column_family]:
                return {column_family: {column: row_data[column_family][column][:max_versions]}}
            return {column_family: row_data[column_family]}
        
        return row_data
    
    def scan(
        self,
        start_row: str,
        end_row: str
    ) -> Iterator[Tuple[str, Dict]]:
        """Scan rows in range"""
        for row_key in self.data.irange(start_row, end_row, inclusive=(True, False)):
            yield row_key, self.data[row_key]
    
    def is_full(self) -> bool:
        return self.size_bytes >= self.max_size
    
    def freeze(self):
        """Freeze memtable for flushing"""
        self.frozen = True


class SSTableWriter:
    """Writes memtable to SSTable file format"""
    
    def __init__(self, file_path: str, block_size: int = 64 * 1024):
        self.file_path = file_path
        self.block_size = block_size
        self.index_entries = []
        self.bloom_filter = BloomFilter(expected_items=100000)
    
    def write(self, memtable: Memtable):
        """Write memtable to SSTable file"""
        with open(self.file_path, 'wb') as f:
            current_block = []
            current_block_size = 0
            block_offset = 0
            first_key_in_block = None
            
            for row_key, row_data in memtable.data.items():
                # Serialize row
                row_bytes = self._serialize_row(row_key, row_data)
                
                # Add to bloom filter
                self.bloom_filter.add(row_key)
                
                # Check if we need new block
                if current_block_size + len(row_bytes) > self.block_size and current_block:
                    # Write current block
                    block_data = b''.join(current_block)
                    f.write(block_data)
                    
                    # Record in index
                    self.index_entries.append((first_key_in_block, block_offset))
                    
                    block_offset += len(block_data)
                    current_block = []
                    current_block_size = 0
                    first_key_in_block = None
                
                if first_key_in_block is None:
                    first_key_in_block = row_key
                
                current_block.append(row_bytes)
                current_block_size += len(row_bytes)
            
            # Write last block
            if current_block:
                block_data = b''.join(current_block)
                f.write(block_data)
                self.index_entries.append((first_key_in_block, block_offset))
                block_offset += len(block_data)
            
            # Write index
            index_offset = block_offset
            f.write(self._serialize_index())
            
            # Write bloom filter
            bloom_offset = f.tell()
            f.write(self.bloom_filter.serialize())
            
            # Write footer
            f.write(self._serialize_footer(index_offset, bloom_offset))


class SSTableReader:
    """Reads from SSTable file"""
    
    def __init__(self, file_path: str):
        self.file_path = file_path
        self.index = []
        self.bloom_filter = None
        self._load_metadata()
    
    def _load_metadata(self):
        """Load index and bloom filter into memory"""
        with open(self.file_path, 'rb') as f:
            # Read footer
            f.seek(-16, 2)  # Last 16 bytes
            footer = f.read(16)
            index_offset, bloom_offset = struct.unpack('QQ', footer)
            
            # Load index
            f.seek(index_offset)
            index_data = f.read(bloom_offset - index_offset)
            self.index = self._parse_index(index_data)
            
            # Load bloom filter
            f.seek(bloom_offset)
            bloom_data = f.read()
            bloom_data = bloom_data[:-16]  # Exclude footer
            self.bloom_filter = BloomFilter.deserialize(bloom_data)
    
    def get(self, row_key: str) -> Optional[Dict]:
        """Get a row from SSTable"""
        # Check bloom filter first
        if not self.bloom_filter.might_contain(row_key):
            return None
        
        # Binary search index for block
        block_idx = self._find_block(row_key)
        if block_idx < 0:
            return None
        
        # Read and search block
        block = self._read_block(block_idx)
        return self._search_block(block, row_key)
    
    def _find_block(self, row_key: str) -> int:
        """Binary search to find block containing key"""
        left, right = 0, len(self.index) - 1
        result = -1
        
        while left <= right:
            mid = (left + right) // 2
            if self.index[mid][0] <= row_key:
                result = mid
                left = mid + 1
            else:
                right = mid - 1
        
        return result


class TabletServer:
    """
    Manages multiple tablets.
    Handles reads, writes, and compaction.
    """
    
    def __init__(self, server_id: str, gfs_client, chubby_client):
        self.server_id = server_id
        self.gfs = gfs_client
        self.chubby = chubby_client
        self.tablets: Dict[str, Tablet] = {}
    
    def load_tablet(self, tablet_id: str, metadata: dict):
        """Load a tablet from GFS"""
        tablet = Tablet(
            tablet_id=tablet_id,
            start_row=metadata['start_row'],
            end_row=metadata['end_row'],
            sstable_files=metadata['sstables'],
            gfs_client=self.gfs
        )
        
        # Replay commit log
        tablet.replay_log(metadata['commit_log'])
        
        self.tablets[tablet_id] = tablet
    
    def write(
        self,
        tablet_id: str,
        row_key: str,
        mutations: List[dict]
    ):
        """Write to a tablet"""
        tablet = self.tablets[tablet_id]
        
        # Write to commit log first (durability)
        log_entry = self._create_log_entry(row_key, mutations)
        tablet.commit_log.append(log_entry)
        
        # Apply to memtable
        for mutation in mutations:
            tablet.memtable.put(
                row_key=row_key,
                column_family=mutation['cf'],
                column=mutation['column'],
                value=mutation['value'],
                timestamp=mutation.get('timestamp')
            )
        
        # Check if memtable needs flushing
        if tablet.memtable.is_full():
            self._flush_memtable(tablet)
    
    def read(
        self,
        tablet_id: str,
        row_key: str,
        columns: Optional[List[str]] = None
    ) -> Optional[Dict]:
        """Read from a tablet"""
        tablet = self.tablets[tablet_id]
        
        # Merge from all sources
        result = {}
        
        # Check memtable (newest)
        mem_result = tablet.memtable.get(row_key)
        if mem_result:
            result = self._merge_results(result, mem_result)
        
        # Check immutable memtable
        if tablet.immutable_memtable:
            imm_result = tablet.immutable_memtable.get(row_key)
            if imm_result:
                result = self._merge_results(result, imm_result)
        
        # Check SSTables (oldest to newest)
        for sstable in tablet.sstables:
            sst_result = sstable.get(row_key)
            if sst_result:
                result = self._merge_results(result, sst_result)
        
        return result if result else None
    
    def _flush_memtable(self, tablet: 'Tablet'):
        """Flush memtable to SSTable"""
        # Freeze current memtable
        tablet.memtable.freeze()
        tablet.immutable_memtable = tablet.memtable
        tablet.memtable = Memtable()
        
        # Write to SSTable in background
        sstable_path = f"{tablet.tablet_id}_{time.time()}.sst"
        writer = SSTableWriter(sstable_path)
        writer.write(tablet.immutable_memtable)
        
        # Add to tablet's SSTables
        tablet.sstables.append(SSTableReader(sstable_path))
        tablet.immutable_memtable = None
        
        # Clear commit log
        tablet.commit_log.clear()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326

Compaction

┌─────────────────────────────────────────────────────────────────────────┐
│                    Compaction Types                                      │
│                                                                          │
│   Minor Compaction:                                                     │
│   - Convert memtable to SSTable                                         │
│   - Frees memory                                                        │
│   - Reduces commit log                                                  │
│                                                                          │
│   Merging Compaction:                                                   │
│   - Merge multiple SSTables into one                                    │
│   - Reduces number of files to check on read                           │
│   - Reclaims space from deletions                                       │
│                                                                          │
│   Major Compaction:                                                     │
│   - Merge ALL SSTables into single SSTable                             │
│   - Removes all deleted data and old versions                          │
│   - Most expensive, done periodically                                   │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │   Before Merging:                                                 │  │
│   │   SST1: [(a,1), (b,1), (c,1)]                                    │  │
│   │   SST2: [(a,2), (d,1)]         a has newer version in SST2       │  │
│   │   SST3: [(b, DELETE), (e,1)]   b is deleted                      │  │
│   │                                                                   │  │
│   │   After Major Compaction:                                        │  │
│   │   SST: [(a,2), (c,1), (d,1), (e,1)]                              │  │
│   │                                                                   │  │
│   │   - Old version of 'a' removed                                   │  │
│   │   - 'b' deleted entirely                                         │  │
│   │   - Space reclaimed                                              │  │
│   └──────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

Locality Groups

┌─────────────────────────────────────────────────────────────────────────┐
│                    Locality Groups                                       │
│                                                                          │
│   Column families can be grouped into locality groups                   │
│   Each locality group stored in separate SSTable                        │
│                                                                          │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │   Example: Web crawl table                                        │  │
│   │                                                                   │  │
│   │   Locality Group 1: "contents"                                   │  │
│   │   - Column family: contents                                      │  │
│   │   - Large, read infrequently                                     │  │
│   │   - Stored on disk                                               │  │
│   │                                                                   │  │
│   │   Locality Group 2: "metadata"                                   │  │
│   │   - Column families: language, checksum, links                   │  │
│   │   - Small, read frequently                                       │  │
│   │   - Cached in memory                                             │  │
│   │                                                                   │  │
│   │   Benefits:                                                      │  │
│   │   - Read language without loading contents                       │  │
│   │   - Better cache utilization                                     │  │
│   │   - Different compression for different types                    │  │
│   │                                                                   │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│   Options per locality group:                                           │
│   - In-memory (for small, frequently accessed data)                    │
│   - Compression algorithm (none, gzip, snappy)                         │
│   - Bloom filter settings                                              │
└─────────────────────────────────────────────────────────────────────────┘

Master Operations

python

class BigtableMaster:
    """
    Master server responsibilities:
    - Tablet assignment to servers
    - Detecting tablet server failures
    - Load balancing tablets
    - Schema changes
    """
    
    def __init__(self, chubby_client):
        self.chubby = chubby_client
        self.tablet_servers: Dict[str, TabletServerInfo] = {}
        self.tablet_assignments: Dict[str, str] = {}  # tablet_id -> server_id
        
        # Acquire master lock in Chubby
        self.master_lock = self.chubby.acquire("/bigtable/master")
    
    def assign_tablet(self, tablet_id: str):
        """Assign an unassigned tablet to a server"""
        # Find server with least load
        target_server = self._find_least_loaded_server()
        
        # Update METADATA table
        self._update_metadata(tablet_id, target_server)
        
        # Notify tablet server
        self._send_assignment(target_server, tablet_id)
        
        self.tablet_assignments[tablet_id] = target_server
    
    def monitor_tablet_servers(self):
        """Monitor tablet server health via Chubby"""
        # Each tablet server holds exclusive lock on file in Chubby
        # Lock directory: /bigtable/servers/{server_id}
        
        while True:
            for server_id, info in list(self.tablet_servers.items()):
                lock_path = f"/bigtable/servers/{server_id}"
                
                if not self.chubby.file_exists(lock_path):
                    # Server lost its lock - it's dead
                    self._handle_server_failure(server_id)
            
            time.sleep(1)
    
    def _handle_server_failure(self, server_id: str):
        """Reassign tablets from failed server"""
        # Find tablets assigned to failed server
        tablets_to_reassign = [
            tid for tid, sid in self.tablet_assignments.items()
            if sid == server_id
        ]
        
        # Remove server from active set
        del self.tablet_servers[server_id]
        
        # Reassign each tablet
        for tablet_id in tablets_to_reassign:
            del self.tablet_assignments[tablet_id]
            self.assign_tablet(tablet_id)
    
    def split_tablet(self, tablet_id: str, split_key: str):
        """Split a tablet that has grown too large"""
        current_server = self.tablet_assignments[tablet_id]
        
        # Create two new tablet entries in METADATA
        new_tablet_id_1 = f"{tablet_id}_a"
        new_tablet_id_2 = f"{tablet_id}_b"
        
        # Old tablet: [start, end) → [start, split_key) and [split_key, end)
        
        # Update METADATA
        self._create_tablet_metadata(new_tablet_id_1, start_row, split_key)
        self._create_tablet_metadata(new_tablet_id_2, split_key, end_row)
        
        # Assign tablets (may go to different servers for load balance)
        self.assign_tablet(new_tablet_id_1)
        self.assign_tablet(new_tablet_id_2)

Optimizations

Bloom Filters

Avoid disk reads for non-existent keys

┌──────────────────────────────────────────────────────────────────┐
│ Before:                                                          │
│   Read "xyz" → check every SSTable → not found                  │
│   Cost: Multiple disk reads                                      │
│                                                                  │
│ After:                                                           │
│   Read "xyz" → bloom filter says "definitely not" → skip file   │
│   Cost: In-memory bit check                                      │
│                                                                  │
│ False positive rate ~1% with 10 bits per key                    │
└──────────────────────────────────────────────────────────────────┘

Block Cache

Two-level caching:
1. Scan cache: For sequential reads (MapReduce)
2. Block cache: For random reads (serving)

Trade-off configurable per table based on workload

Commit Log Optimization

┌──────────────────────────────────────────────────────────────────┐
│ Problem:                                                         │
│   Separate log per tablet → many concurrent writes to GFS       │
│                                                                  │
│ Solution:                                                        │
│   Single log per tablet server, interleaved entries             │
│                                                                  │
│ Recovery complexity:                                             │
│   Sort log by (tablet, sequence) → parallel replay per tablet   │
└──────────────────────────────────────────────────────────────────┘

Real-World Usage at Google

Application	Data Size	Notes
Google Analytics	200 TB	Click stream data
Google Earth	70 TB	Satellite imagery
Personalized Search	4 PB	User indices
Crawl	800 TB	Web pages
Orkut	9 TB	Social graph

Influence and Legacy

Direct Descendants

Apache HBase - Open source Bigtable implementation
Apache Cassandra - Combined Bigtable data model with Dynamo distribution

Concepts Adopted

SSTable format (RocksDB, LevelDB)
LSM-tree storage (most modern databases)
Tablet-based sharding
Column families
Multi-version timestamps

Key Takeaways

Simple data model, complex implementation - Sparse map with timestamps is easy to understand, powerful in practice.
Single-row transactions sufficient - Most applications don't need cross-row ACID.
SSTables + memtables = LSM tree - Write-optimized storage with good read performance.
Locality groups optimize access patterns - Separate storage for different column families.
Tablets enable fine-grained scaling - Move, split, and replicate at tablet granularity.
Bloom filters essential for sparse data - Avoid disk reads for absent keys.
Build on reliable components - GFS for storage, Chubby for coordination.
Compression and caching are critical - Trade CPU for I/O at this scale.

Bigtable: A Distributed Storage System for Structured Data ​

Paper Overview ​

TL;DR ​

Problem Statement ​

Data Model ​

Key Design Choices ​

Architecture ​

Tablets ​

Tablet Location ​

SSTable Storage Format ​

Tablet Serving ​

Implementation ​

Compaction ​

Locality Groups ​

Master Operations ​

Optimizations ​

Bloom Filters ​

Block Cache ​

Commit Log Optimization ​

Real-World Usage at Google ​

Influence and Legacy ​

Direct Descendants ​

Concepts Adopted ​

Key Takeaways ​