SSTableとコンパクション

この記事は英語版から翻訳されました。最新版は英語版をご覧ください。

TL;DR

SSTable（Sorted String Table）は、LSM木のディスク層を構成するイミュータブルなソート済みファイルです。コンパクションはSSTableをマージして、空間を回収し、古いデータを削除し、読み取りパフォーマンスを維持します。適切なコンパクション戦略（サイズ階層型、レベル型、FIFO）の選択は、ワークロードの読み取り/書き込みバランスとレイテンシ要件に依存します。

SSTableの構造

ファイルレイアウト

┌─────────────────────────────────────────────────────────────┐
│                        SSTable File                        │
├─────────────────────────────────────────────────────────────┤
│  Data Block 1  │  Data Block 2  │  ...  │  Data Block N    │
├─────────────────────────────────────────────────────────────┤
│               Meta Block (Bloom Filter)                    │
├─────────────────────────────────────────────────────────────┤
│                    Index Block                              │
├─────────────────────────────────────────────────────────────┤
│                       Footer                                │
└─────────────────────────────────────────────────────────────┘

データブロック

Block structure:
┌────────────────────────────────────────┐
│  Entry 1  │  Entry 2  │  ...  │ Entry N │
├────────────────────────────────────────┤
│           Restart Points               │
├────────────────────────────────────────┤
│  Num Restarts (4 bytes)  │  CRC (4)    │
└────────────────────────────────────────┘

Entry format:
  [shared_prefix_len][unshared_len][value_len][unshared_key][value]

Prefix compression:
  Keys often share prefixes
  Only store the difference
  "user:1000" → "user:1001" stored as "01" suffix

インデックスブロック

Sparse index:
  Every N-th key → block offset

┌───────────────────────────────────┐
│  "aaa" → Block 0 @ offset 0       │
│  "abc" → Block 1 @ offset 4096    │
│  "def" → Block 2 @ offset 8192    │
│  ...                              │
└───────────────────────────────────┘

Lookup:
  1. Binary search index for >= key
  2. Read data block
  3. Binary search within block

フッタ

┌─────────────────────────────────────────┐
│  Metaindex Handle (offset, size)        │
│  Index Handle (offset, size)            │
│  Padding                                │
│  Magic Number (8 bytes)                 │
└─────────────────────────────────────────┘

Fixed size, always at end of file
Contains pointers to index and meta blocks

SSTableの操作

SSTableの作成（フラッシュ）

python

def create_sstable(memtable, filename):
    writer = SSTableWriter(filename)

    # Iterate memtable in sorted order
    for key, value in memtable.sorted_iterator():
        writer.add(key, value)

    # Finalize: write index, bloom filter, footer
    writer.finish()

    return SSTable(filename)

class SSTableWriter:
    def __init__(self, filename):
        self.file = open(filename, 'wb')
        self.index = []
        self.bloom_filter = BloomFilter()
        self.current_block = DataBlock()
        self.block_offset = 0

    def add(self, key, value):
        self.bloom_filter.add(key)

        if self.current_block.size() > BLOCK_SIZE:
            self.flush_block()

        self.current_block.add(key, value)

    def flush_block(self):
        # Write block
        data = self.current_block.serialize()
        self.file.write(data)

        # Record in index
        first_key = self.current_block.first_key
        self.index.append((first_key, self.block_offset))

        self.block_offset += len(data)
        self.current_block = DataBlock()

SSTableからの読み取り

python

class SSTable:
    def get(self, key):
        # Check bloom filter first
        if not self.bloom_filter.might_contain(key):
            return None  # Definitely not here

        # Binary search in index
        block_offset = self.find_block(key)

        # Read and search block
        block = self.read_block(block_offset)
        return block.search(key)

    def find_block(self, key):
        # Binary search for last index entry <= key
        lo, hi = 0, len(self.index) - 1
        while lo < hi:
            mid = (lo + hi + 1) // 2
            if self.index[mid].key <= key:
                lo = mid
            else:
                hi = mid - 1
        return self.index[lo].offset

範囲スキャン

python

def scan(self, start_key, end_key):
    # Find starting block
    block_offset = self.find_block(start_key)

    results = []
    while block_offset < self.data_size:
        block = self.read_block(block_offset)

        for key, value in block.entries():
            if key > end_key:
                return results
            if key >= start_key:
                results.append((key, value))

        block_offset = self.next_block_offset(block_offset)

    return results

コンパクション戦略

サイズ階層型コンパクション（STCS）

Group SSTables by size, merge when count exceeds threshold.

Before:
  Tier 0 (small): [1MB] [1MB] [1MB] [1MB]  ← 4 files, trigger!
  Tier 1 (medium): [4MB]
  Tier 2 (large): [16MB]

After:
  Tier 0: []
  Tier 1: [4MB] [4MB]  ← merged output
  Tier 2: [16MB]

アルゴリズム：

python

def size_tiered_compaction():
    for tier in tiers:
        if tier.file_count >= MIN_THRESHOLD:
            # Pick similar-sized files
            files = tier.select_files(MIN_THRESHOLD)

            # Merge into new file
            merged = merge(files)

            # Add to appropriate tier
            next_tier = find_tier_for_size(merged.size)
            next_tier.add(merged)

            # Remove old files
            tier.remove(files)

特性：

側面	値
書き込み増幅	約3-5倍
空間増幅	最大2倍
読み取り増幅	O(tiers × files per tier)
最適な用途	書き込み中心のワークロード

レベル型コンパクション（LCS）

Organize files into levels with size limits.
Each level (except L0) has non-overlapping key ranges.

Level 0: 4 files max (may overlap)
Level 1: 10 MB total
Level 2: 100 MB total
Level 3: 1000 MB total
...

Compaction:
  When level exceeds limit, pick file and merge with overlapping files in next level.

Before:
  L0: [a-d] [c-f] [e-h]  ← overlapping
  L1: [a-c] [d-f] [g-i]  ← non-overlapping

L0 exceeds limit, compact [c-f]:
  Overlaps with L1's [d-f]
  Merge them

After:
  L0: [a-d] [e-h]
  L1: [a-c] [c-f'] [g-i]  ← f' is merged result

アルゴリズム：

python

def leveled_compaction():
    for level in range(MAX_LEVELS - 1):
        if level_size(level) > size_limit(level):
            # Pick file to compact (round-robin or by overlap)
            file = pick_file_to_compact(level)

            # Find overlapping files in next level
            overlapping = find_overlapping(level + 1, file.key_range)

            # Merge
            merged_files = merge(file, overlapping)

            # Add to next level
            add_to_level(level + 1, merged_files)

            # Remove old
            remove_files([file] + overlapping)

特性：

側面	値
書き込み増幅	約10-30倍
空間増幅	約1.1倍
読み取り増幅	O(levels)
最適な用途	読み取り中心、空間効率重視

FIFOコンパクション

Delete oldest files when total size exceeds limit.
No merging, just time-based deletion.

├── newest_file.sst
├── file_2.sst
├── file_3.sst
├── oldest_file.sst  ← Delete when over limit

ユースケース：

TTL付きの時系列データ
N日後に期限切れになるログ
固定保持期間のメトリクス

タイムウィンドウコンパクション（TWCS）

Combine STCS within time windows, FIFO across windows.

Time windows:
  [Today]  [Yesterday]  [2 days ago]  [3 days ago] ...

Within each window: Size-tiered compaction
When window expires: Drop entirely

Good for time-series with TTL
Avoids mixing old and new data

マージプロセス

N方向マージ

python

def merge(sstables):
    # Use min-heap for efficient N-way merge
    heap = MinHeap()
    iterators = [sst.iterator() for sst in sstables]

    # Initialize heap with first entry from each
    for i, it in enumerate(iterators):
        if it.valid():
            heap.push((it.key(), it.value(), i))
            it.next()

    writer = SSTableWriter()
    last_key = None

    while not heap.empty():
        key, value, sst_idx = heap.pop()

        # Keep only latest version of each key
        if key != last_key:
            if not is_tombstone(value) or not at_bottom_level:
                writer.add(key, value)
            last_key = key

        # Advance that SSTable's iterator
        it = iterators[sst_idx]
        if it.valid():
            heap.push((it.key(), it.value(), sst_idx))
            it.next()

    return writer.finish()

バージョンの処理

Same key in multiple SSTables:
  [key: "a", value: "v1", seq: 100]  ← older
  [key: "a", value: "v2", seq: 150]  ← newer

Merge keeps seq: 150 (latest)
But if snapshot active at seq: 120, must keep both!

トゥームストーンの処理

Tombstone (delete marker):
  [key: "a", TOMBSTONE, seq: 200]

Can only be garbage collected when:
  - At bottom level
  - All older versions are in this compaction
  - No active snapshots before tombstone seq

コンパクションのスケジューリング

いつコンパクションするか

Triggers:
  1. L0 file count exceeds threshold
  2. Level size exceeds limit
  3. Tombstone ratio too high
  4. Manual trigger

Priority:
  L0 compaction > Other levels
  (L0 blocks writes when too many files)

レート制限

Problem: Compaction uses disk I/O
  - Competes with foreground reads/writes
  - Can cause latency spikes

Solutions:
  - Rate limit compaction I/O (bytes/sec)
  - Priority I/O scheduling (ionice)
  - Adaptive rate based on workload

スレッドプール

Typical configuration:
  - 1-2 threads for L0→L1 (critical path)
  - Multiple threads for other levels
  - Separate thread pool for flushing

コンパクションのトレードオフ

書き込み増幅の詳細

Leveled with ratio R=10:

Key written to MemTable: 1 write
Flushed to L0: 1 write
Compacted L0→L1: ~1 write
Compacted L1→L2: ~10 writes (merged with 10 L1 files)
Compacted L2→L3: ~10 writes
...

Total: 1 + 1 + 1 + 10 + 10 + 10 + ... = O(10 * L)

For 4 levels: ~40x write amplification

空間増幅の詳細

Size-tiered worst case:
  4 files in tier, about to merge
  + space for merged output
  = 2x space temporarily

Leveled:
  Non-overlapping files
  Only small compaction in progress
  ~1.1x typical

読み取り増幅の詳細

Size-tiered:
  Check all tiers (O(T))
  Check all files in tier (O(F))
  Total: O(T * F)

Leveled:
  L0: Check all files (small count)
  L1+: One file per level (non-overlapping)
  Total: O(L0 files + Levels)

Bloom filters reduce actual disk reads significantly

コンパクションのモニタリング

主要メトリクス

Compaction pending bytes:
  How much work is queued
  Growing = compaction falling behind

Compaction I/O rate:
  MB/s written by compaction
  Compare to write rate

L0 file count:
  Should stay below stall threshold
  Spikes = write bursts or slow compaction

Write stalls:
  Time spent waiting for compaction
  Should be near zero normally

RocksDBの統計

db.getProperty("rocksdb.compaction-pending")
db.getProperty("rocksdb.num-files-at-level0")
db.getProperty("rocksdb.estimate-pending-compaction-bytes")
db.getProperty("rocksdb.compaction-reason")

最適化

トリビアルムーブ

If L(n) file doesn't overlap with L(n+1):
  Just move file pointer, don't rewrite!

Huge win for sequential insert patterns

サブコンパクション

Split large compaction into parallel sub-ranges:

File [a-z] overlaps with 10 files in next level
Split into:
  [a-j] + overlapping → thread 1
  [k-z] + overlapping → thread 2

Parallel merge, faster completion

動的レベルターゲット

RocksDB's dynamic leveling:
  Adjust level targets based on actual data size
  Avoids extra compaction when data is small
  Grows naturally as data grows

重要なポイント

SSTableはイミュータブル - 一度書き込み、何度も読み取り
コンパクションは核心的なトレードオフ - 書き込み vs 読み取り vs 空間
サイズ階層型は書き込みに有利 - ただし空間を浪費します
レベル型は読み取りに有利 - ただし書き込み増幅が大きくなります
FIFOは時系列データ向け - 古いデータが自然に期限切れになる場合
L0は重要 - ファイルが多すぎると書き込みがストールします
ブルームフィルタは不可欠 - ディスク読み取りを劇的に削減します
コンパクション負債を監視する - 遅れるとパフォーマンスに影響します

SSTableとコンパクション ​

TL;DR ​

SSTableの構造 ​

ファイルレイアウト ​

データブロック ​

インデックスブロック ​

フッタ ​

SSTableの操作 ​

SSTableの作成（フラッシュ） ​

SSTableからの読み取り ​

範囲スキャン ​

コンパクション戦略 ​

サイズ階層型コンパクション（STCS） ​

レベル型コンパクション（LCS） ​

FIFOコンパクション ​

タイムウィンドウコンパクション（TWCS） ​

マージプロセス ​

N方向マージ ​

バージョンの処理 ​

トゥームストーンの処理 ​

コンパクションのスケジューリング ​

いつコンパクションするか ​

レート制限 ​

スレッドプール ​

コンパクションのトレードオフ ​

書き込み増幅の詳細 ​

空間増幅の詳細 ​

読み取り増幅の詳細 ​

コンパクションのモニタリング ​

主要メトリクス ​

RocksDBの統計 ​

最適化 ​

トリビアルムーブ ​

サブコンパクション ​

動的レベルターゲット ​

重要なポイント ​

SSTableとコンパクション

TL;DR

SSTableの構造

ファイルレイアウト

データブロック

インデックスブロック

フッタ

SSTableの操作

SSTableの作成（フラッシュ）

SSTableからの読み取り

範囲スキャン

コンパクション戦略

サイズ階層型コンパクション（STCS）

レベル型コンパクション（LCS）

FIFOコンパクション

タイムウィンドウコンパクション（TWCS）

マージプロセス

N方向マージ

バージョンの処理

トゥームストーンの処理

コンパクションのスケジューリング

いつコンパクションするか

レート制限

スレッドプール

コンパクションのトレードオフ

書き込み増幅の詳細

空間増幅の詳細

読み取り増幅の詳細

コンパクションのモニタリング

主要メトリクス

RocksDBの統計

最適化

トリビアルムーブ

サブコンパクション

動的レベルターゲット

重要なポイント