検索関連性チューニング
注: この記事は英語版
14-search-systems/05-search-relevance-tuning.mdの日本語翻訳です。
TL;DR
検索関連性チューニングは、クエリ理解、評価メトリクス、実験を通じて検索品質を反復的に改善するプロセスです。ユーザーの意図の理解、NDCG や MRR などのメトリクスによる品質測定、A/B テストの実施、フィードバックに基づく継続的な改善が含まれます。優れた関連性チューニングにはオフライン分析とオンライン実験の両方が必要です。
関連性エンジニアリングループ
継続的改善サイクル
┌─────────────────────────────────────────────────────────────────┐
│ Relevance Engineering Loop │
│ │
│ ┌──────────────┐ │
│ │ MEASURE │ │
│ │ │ │
│ │ • Metrics │ │
│ │ • User │ │
│ │ feedback │ │
│ └──────┬───────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ▼ │ │ │
│ ┌──────────────┐ │ ┌──────────────┐ │
│ │ ANALYZE │ │ │ DEPLOY │ │
│ │ │ │ │ │ │
│ │ • Failure │ │ │ • A/B test │ │
│ │ analysis │ │ │ • Gradual │ │
│ │ • Query │ │ │ rollout │ │
│ │ segments │ │ │ │ │
│ └──────┬───────┘ │ └──────▲───────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ │ │ │
│ ┌──────────────┐ │ ┌──────────────┐ │
│ │ HYPOTHESIZE │───────┴──────►│ BUILD │ │
│ │ │ │ │ │
│ │ • Root cause │ │ • Feature │ │
│ │ • Solutions │ │ changes │ │
│ │ │ │ • Model │ │
│ │ │ │ updates │ │
│ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘関連性エンジニアリングループは、測定 → 分析 → 仮説 → 構築 → デプロイのサイクルで継続的に改善します。
クエリ理解
クエリ分類
class QueryClassifier:
"""
Classify queries to apply different ranking strategies
"""
def classify_intent(self, query):
"""
Query intent types:
- Navigational: User wants specific site ("facebook login")
- Informational: User wants information ("how to cook pasta")
- Transactional: User wants to do something ("buy iphone")
- Local: User wants nearby results ("pizza near me")
"""
features = self.extract_features(query)
return self.intent_model.predict(features)
def classify_complexity(self, query):
"""
Query complexity:
- Head: High volume, well-understood ("weather")
- Torso: Medium volume, some ambiguity ("python")
- Tail: Low volume, specific ("python asyncio connection pool timeout")
"""
query_freq = self.get_query_frequency(query)
if query_freq > 10000: # Per day
return "head"
elif query_freq > 100:
return "torso"
else:
return "tail"
def detect_modifiers(self, query):
"""
Detect query modifiers that affect ranking
"""
modifiers = {
'freshness': self.has_freshness_intent(query), # "latest", "2024"
'location': self.extract_location(query), # "in tokyo"
'price': self.has_price_intent(query), # "cheap", "under $50"
'comparison': self.has_comparison(query), # "vs", "versus"
'review': self.wants_reviews(query), # "review", "rating"
}
return modifiers
# Example usage
classifier = QueryClassifier()
queries = [
"facebook", # Navigational, head
"how to learn python", # Informational, head
"buy macbook pro 2024", # Transactional, torso
"best italian restaurant downtown", # Local, transactional
"asyncio semaphore timeout error", # Informational, tail
]
for query in queries:
intent = classifier.classify_intent(query)
complexity = classifier.classify_complexity(query)
mods = classifier.detect_modifiers(query)
print(f"{query}: {intent}, {complexity}, {mods}")クエリはインテント(ナビゲーション、情報取得、トランザクション、ローカル)と複雑度(ヘッド、トルソー、テール)に分類されます。異なるランキング戦略を適用するために使用します。
クエリリライティング
class QueryRewriter:
"""
Improve queries for better matching
"""
def expand_synonyms(self, query):
"""
Add synonyms to improve recall
"cheap hotel" → "cheap OR budget OR affordable hotel OR accommodation"
"""
tokens = tokenize(query)
expanded = []
for token in tokens:
synonyms = self.synonym_dict.get(token, [])
if synonyms:
expanded.append(f"({token} OR {' OR '.join(synonyms)})")
else:
expanded.append(token)
return ' '.join(expanded)
def fix_spelling(self, query):
"""
Correct typos while preserving intent
"pythn tutrial" → "python tutorial"
"""
corrections = []
for token in tokenize(query):
if token not in self.vocabulary:
suggestion = self.spell_checker.correct(token)
if self.is_confident(token, suggestion):
corrections.append(suggestion)
else:
corrections.append(token)
else:
corrections.append(token)
return ' '.join(corrections)
def segment_query(self, query):
"""
Identify meaningful segments
"new york pizza" → ["new york", "pizza"] not ["new", "york", "pizza"]
"""
tokens = tokenize(query)
segments = []
i = 0
while i < len(tokens):
# Try longer segments first
for length in range(min(4, len(tokens) - i), 0, -1):
candidate = ' '.join(tokens[i:i+length])
if candidate in self.known_phrases or self.is_entity(candidate):
segments.append(candidate)
i += length
break
else:
segments.append(tokens[i])
i += 1
return segments
def remove_stopwords(self, query, aggressive=False):
"""
Remove stopwords, but carefully
"the who" should not become "who" (band name)
"to be or not to be" should stay intact (famous quote)
"""
if query.lower() in self.protected_phrases:
return query
tokens = tokenize(query)
if aggressive:
return ' '.join(t for t in tokens if t.lower() not in self.stopwords)
else:
return ' '.join(
t for t in tokens
if t.lower() not in self.stopwords or self.is_meaningful_stopword(t, tokens)
)クエリリライティングには、同義語展開(再現率の向上)、スペル修正(意図を保持しながらタイポを修正)、クエリセグメンテーション(意味のあるセグメントの識別)、ストップワード除去(意味を壊さないよう注意)が含まれます。
クエリ緩和
def relax_query_if_no_results(original_query, search_func):
"""
Progressively relax query to find results
"""
relaxation_strategies = [
# Level 1: Try exact query
lambda q: q,
# Level 2: Remove quotes (phrase → terms)
lambda q: q.replace('"', ''),
# Level 3: Remove filters
lambda q: remove_filters(q),
# Level 4: Remove less important terms
lambda q: keep_important_terms(q, top_k=3),
# Level 5: Spell correction
lambda q: spell_correct(q),
# Level 6: Synonym expansion
lambda q: expand_synonyms(q),
# Level 7: Semantic search (vector)
lambda q: f"~semantic:{q}",
]
for i, strategy in enumerate(relaxation_strategies):
relaxed = strategy(original_query)
results = search_func(relaxed)
if results:
if i > 0:
log_relaxation(original_query, relaxed, i)
return results, relaxed, i
return [], original_query, len(relaxation_strategies)
# Example
query = '"exact phrase match" site:example.com filetype:pdf'
# Level 0: No results
# Level 1: "exact phrase match" → No results
# Level 2: exact phrase match → No results
# Level 3: exact phrase match → Found 5 results結果が得られない場合、段階的にクエリを緩和して結果を見つけます。完全一致 → 引用符の除去 → フィルターの除去 → 重要な用語のみ保持 → スペル修正 → 同義語展開 → セマンティック検索の順に試みます。
オフライン評価
テストコレクションの構築
class RelevanceTestCollection:
"""
Curated set of queries with judged results
"""
def __init__(self):
self.queries = [] # List of test queries
self.judgments = {} # query → {doc_id: relevance_label}
def add_query(self, query, judged_docs):
"""
Add a query with relevance judgments
judgments: {doc_id: grade}
grades: 0=Bad, 1=Fair, 2=Good, 3=Excellent, 4=Perfect
"""
self.queries.append(query)
self.judgments[query] = judged_docs
def sample_queries_for_judging(self, query_log, n_samples=1000):
"""
Sample representative queries for human judging
"""
# Stratified sampling by query frequency
head_queries = sample_by_frequency(query_log, 'head', n_samples // 3)
torso_queries = sample_by_frequency(query_log, 'torso', n_samples // 3)
tail_queries = sample_by_frequency(query_log, 'tail', n_samples // 3)
return head_queries + torso_queries + tail_queries
def pool_documents_for_judging(self, query, systems, k=100):
"""
Pool top results from multiple systems for judging
Ensures we judge documents that any system might return
"""
pooled = set()
for system in systems:
results = system.search(query, top_k=k)
pooled.update(doc.id for doc in results)
return list(pooled)
# Judging guidelines
JUDGING_GUIDELINES = """
4 - Perfect: Exact answer to query, authoritative source
3 - Excellent: Highly relevant, comprehensive answer
2 - Good: Relevant, addresses query but not completely
1 - Fair: Marginally relevant, tangentially related
0 - Bad: Not relevant, spam, or broken
"""テストコレクションは、ヘッド、トルソー、テールクエリの層化サンプリングで代表的なクエリを収集し、複数のシステムの上位結果をプールして人手で判定します。
メトリクスの計算
import numpy as np
from collections import defaultdict
class RelevanceMetrics:
@staticmethod
def precision_at_k(retrieved, relevant, k):
"""
Fraction of top-k that are relevant
"""
retrieved_k = retrieved[:k]
relevant_in_k = sum(1 for doc in retrieved_k if doc in relevant)
return relevant_in_k / k
@staticmethod
def recall_at_k(retrieved, relevant, k):
"""
Fraction of relevant docs in top-k
"""
retrieved_k = set(retrieved[:k])
return len(retrieved_k & relevant) / len(relevant)
@staticmethod
def average_precision(retrieved, relevant):
"""
Average of precision@k for each relevant doc
"""
if not relevant:
return 0
precisions = []
relevant_found = 0
for i, doc in enumerate(retrieved, 1):
if doc in relevant:
relevant_found += 1
precisions.append(relevant_found / i)
return sum(precisions) / len(relevant)
@staticmethod
def ndcg_at_k(retrieved, judgments, k):
"""
Normalized Discounted Cumulative Gain
"""
def dcg(rels):
return sum(
(2 ** rel - 1) / np.log2(i + 2)
for i, rel in enumerate(rels)
)
# Actual relevance scores
actual_rels = [judgments.get(doc, 0) for doc in retrieved[:k]]
# Ideal relevance scores
ideal_rels = sorted(judgments.values(), reverse=True)[:k]
if dcg(ideal_rels) == 0:
return 0
return dcg(actual_rels) / dcg(ideal_rels)
@staticmethod
def err(retrieved, judgments, max_grade=4):
"""
Expected Reciprocal Rank
Models user stopping after finding satisfying result
"""
p_stop = 0
err_value = 0
for i, doc in enumerate(retrieved, 1):
grade = judgments.get(doc, 0)
p_satisfy = (2 ** grade - 1) / (2 ** max_grade)
err_value += (1 - p_stop) * p_satisfy / i
p_stop += (1 - p_stop) * p_satisfy
return err_value
def evaluate_system(system, test_collection, metrics=['ndcg@10', 'map', 'mrr']):
"""
Evaluate a search system on test collection
"""
results = defaultdict(list)
for query in test_collection.queries:
retrieved = system.search(query, top_k=100)
judgments = test_collection.judgments[query]
relevant = {doc for doc, grade in judgments.items() if grade >= 2}
if 'ndcg@10' in metrics:
results['ndcg@10'].append(
RelevanceMetrics.ndcg_at_k(retrieved, judgments, 10)
)
if 'map' in metrics:
results['map'].append(
RelevanceMetrics.average_precision(retrieved, relevant)
)
if 'mrr' in metrics:
rr = 0
for i, doc in enumerate(retrieved, 1):
if doc in relevant:
rr = 1 / i
break
results['mrr'].append(rr)
# Aggregate
return {metric: np.mean(values) for metric, values in results.items()}失敗分析
class FailureAnalyzer:
"""
Analyze where and why search fails
"""
def find_failures(self, system, test_collection, threshold=0.5):
"""
Find queries where system performs poorly
"""
failures = []
for query in test_collection.queries:
retrieved = system.search(query, top_k=10)
judgments = test_collection.judgments[query]
ndcg = RelevanceMetrics.ndcg_at_k(retrieved, judgments, 10)
if ndcg < threshold:
failures.append({
'query': query,
'ndcg': ndcg,
'retrieved': retrieved,
'judgments': judgments
})
return failures
def categorize_failure(self, failure):
"""
Categorize the type of failure
"""
query = failure['query']
retrieved = failure['retrieved']
judgments = failure['judgments']
relevant_docs = {d for d, g in judgments.items() if g >= 2}
perfect_docs = {d for d, g in judgments.items() if g >= 4}
if not any(d in judgments for d in retrieved):
return "RECALL_FAILURE"
if perfect_docs and not any(d in perfect_docs for d in retrieved[:3]):
return "RANKING_FAILURE"
if len(query.split()) > 5:
return "LONG_QUERY"
if self.has_ambiguous_intent(query):
return "AMBIGUOUS_QUERY"
if self.is_entity_query(query) and not self.entity_matched(query, retrieved):
return "ENTITY_RECOGNITION"
return "OTHER"
def suggest_fixes(self, category, examples):
"""
Suggest fixes for each failure category
"""
suggestions = {
'RECALL_FAILURE': [
"Check if relevant documents are indexed",
"Review tokenization and stemming",
"Add synonym expansion",
"Consider query relaxation"
],
'RANKING_FAILURE': [
"Review ranking features",
"Check if BM25 parameters are tuned",
"Add document quality signals",
"Consider learning-to-rank model"
],
'LONG_QUERY': [
"Implement query segmentation",
"Add phrase matching",
"Consider semantic similarity"
],
'AMBIGUOUS_QUERY': [
"Add query classification",
"Implement result diversification",
"Consider personalization"
],
'ENTITY_RECOGNITION': [
"Improve NER model",
"Add entity synonyms to index",
"Consider knowledge graph"
]
}
return suggestions.get(category, ["Investigate further"])失敗は RECALL_FAILURE(関連ドキュメントが取得されない)、RANKING_FAILURE(完璧なドキュメントが低順位)、LONG_QUERY(複雑なクエリの処理不良)、AMBIGUOUS_QUERY(複数の解釈があるクエリ)、ENTITY_RECOGNITION(エンティティ認識の失敗)に分類されます。
オンライン実験(A/B テスト)
実験設計
class SearchExperiment:
"""
A/B test for search changes
"""
def __init__(self, name, hypothesis, metric_targets):
self.name = name
self.hypothesis = hypothesis
self.metric_targets = metric_targets
self.variants = {}
def add_variant(self, name, config, allocation):
"""
Add experiment variant
allocation: Percentage of traffic (0-100)
"""
self.variants[name] = {
'config': config,
'allocation': allocation,
'users': set(),
'impressions': 0,
'clicks': 0,
'metrics': defaultdict(list)
}
def assign_user(self, user_id):
"""
Deterministically assign user to variant
Use hash for consistent assignment
"""
hash_val = hash(f"{self.name}:{user_id}") % 100
cumulative = 0
for name, variant in self.variants.items():
cumulative += variant['allocation']
if hash_val < cumulative:
variant['users'].add(user_id)
return name
return 'control'
# Example experiment setup
experiment = SearchExperiment(
name="bm25_k1_tuning",
hypothesis="Increasing k1 from 1.2 to 1.5 will improve relevance for long documents",
metric_targets={'ndcg@10': 0.02, 'ctr': 0.01}
)
experiment.add_variant('control', {'bm25_k1': 1.2}, allocation=50)
experiment.add_variant('treatment', {'bm25_k1': 1.5}, allocation=50)統計分析
import scipy.stats as stats
import numpy as np
class ExperimentAnalyzer:
def analyze_experiment(self, experiment, min_samples=1000):
"""
Analyze A/B test results
"""
control = experiment.variants['control']
treatment = experiment.variants['treatment']
if control['impressions'] < min_samples:
return {'status': 'INSUFFICIENT_DATA'}
results = {}
# CTR analysis
control_ctr = control['clicks'] / control['impressions']
treatment_ctr = treatment['clicks'] / treatment['impressions']
ctr_lift = (treatment_ctr - control_ctr) / control_ctr
ctr_pvalue = self.proportion_test(
control['clicks'], control['impressions'],
treatment['clicks'], treatment['impressions']
)
results['ctr'] = {
'control': control_ctr,
'treatment': treatment_ctr,
'lift': ctr_lift,
'p_value': ctr_pvalue,
'significant': ctr_pvalue < 0.05
}
return results
def calculate_sample_size(self, baseline_rate, mde, alpha=0.05, power=0.8):
"""
Calculate required sample size for experiment
mde: Minimum Detectable Effect (relative)
"""
effect = baseline_rate * mde
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
p1 = baseline_rate
p2 = baseline_rate + effect
p_avg = (p1 + p2) / 2
n = (
(z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
) / (effect ** 2)
return int(np.ceil(n))
# Example
# Baseline CTR: 5%, want to detect 5% relative improvement
sample_size = analyzer.calculate_sample_size(
baseline_rate=0.05,
mde=0.05,
alpha=0.05,
power=0.8
)
print(f"Need {sample_size} samples per variant") # ~62,000インターリービング
class InterleavedExperiment:
"""
Interleaving: More sensitive than A/B testing
Show results from both systems interleaved,
measure which system's results get more clicks
"""
def interleave_team_draft(self, results_a, results_b, k=10):
"""
Team Draft interleaving
Alternately pick from each team, avoid duplicates
"""
interleaved = []
team_assignments = {}
ptr_a, ptr_b = 0, 0
turn = 'A'
while len(interleaved) < k:
if turn == 'A':
while ptr_a < len(results_a):
doc = results_a[ptr_a]
ptr_a += 1
if doc.id not in team_assignments:
interleaved.append(doc)
team_assignments[doc.id] = 'A'
break
turn = 'B'
else:
while ptr_b < len(results_b):
doc = results_b[ptr_b]
ptr_b += 1
if doc.id not in team_assignments:
interleaved.append(doc)
team_assignments[doc.id] = 'B'
break
turn = 'A'
return interleaved, team_assignments
def analyze_interleaving(self, impressions):
"""
Analyze interleaving results
"""
wins_a = 0
wins_b = 0
ties = 0
for assignments, clicks in impressions:
clicks_a = sum(1 for c in clicks if assignments.get(c) == 'A')
clicks_b = sum(1 for c in clicks if assignments.get(c) == 'B')
if clicks_a > clicks_b:
wins_a += 1
elif clicks_b > clicks_a:
wins_b += 1
else:
ties += 1
total = wins_a + wins_b + ties
decisive = wins_a + wins_b
if decisive > 0:
p_value = stats.binom_test(wins_a, decisive, 0.5)
else:
p_value = 1.0
return {
'wins_a': wins_a,
'wins_b': wins_b,
'ties': ties,
'p_value': p_value,
'winner': 'A' if wins_a > wins_b else ('B' if wins_b > wins_a else 'TIE')
}インターリービングは A/B テストよりも感度が高く、両方のシステムの結果をインターリーブして表示し、どちらのシステムの結果がより多くクリックされるかを測定します。
チューニングテクニック
BM25 パラメーターチューニング
from sklearn.model_selection import ParameterGrid
import numpy as np
def tune_bm25_parameters(search_index, test_collection, param_grid=None):
"""
Grid search for BM25 parameters
k1: Controls term frequency saturation (1.0-2.0)
b: Controls document length normalization (0.0-1.0)
"""
if param_grid is None:
param_grid = {
'k1': [0.5, 0.75, 1.0, 1.2, 1.5, 2.0],
'b': [0.0, 0.25, 0.5, 0.75, 1.0]
}
best_params = None
best_ndcg = 0
results = []
for params in ParameterGrid(param_grid):
search_index.set_bm25_params(**params)
ndcgs = []
for query in test_collection.queries:
retrieved = search_index.search(query, top_k=10)
judgments = test_collection.judgments[query]
ndcg = RelevanceMetrics.ndcg_at_k(retrieved, judgments, 10)
ndcgs.append(ndcg)
avg_ndcg = np.mean(ndcgs)
results.append({'params': params, 'ndcg': avg_ndcg})
if avg_ndcg > best_ndcg:
best_ndcg = avg_ndcg
best_params = params
return best_params, results
# Typical findings:
# - k1=1.2-1.5 works well for most corpora
# - b=0.75 is a good default (standard BM25)
# - Lower b (0.3-0.5) for collections with high length variance
# - Higher k1 (1.5-2.0) when term frequency is important一般的な知見として、k1=1.2-1.5 がほとんどのコーパスで有効、b=0.75 が良いデフォルト値、長さのばらつきが大きいコレクションでは低い b(0.3-0.5)、用語頻度が重要な場合は高い k1(1.5-2.0)が適しています。
フィールドブースティング
# Example Elasticsearch query with field boosting
BOOSTED_QUERY = {
"query": {
"multi_match": {
"query": "python tutorial",
"fields": [
"title^3", # Title matches worth 3x
"body", # Body is baseline
"description^1.5",
"tags^2"
],
"type": "best_fields",
"tie_breaker": 0.3
}
}
}関数スコアチューニング
# Elasticsearch function_score example
FRESHNESS_BOOST_QUERY = {
"query": {
"function_score": {
"query": {"match": {"content": "python"}},
"functions": [
{
"exp": {
"date": {
"origin": "now",
"scale": "30d",
"decay": 0.5
}
},
"weight": 1.5
},
{
"field_value_factor": {
"field": "popularity",
"modifier": "log1p",
"factor": 0.1
}
}
],
"score_mode": "sum",
"boost_mode": "multiply"
}
}
}鮮度ブーストには指数減衰、ガウス減衰、線形減衰の異なる減衰関数を使用できます。人気度ブーストにはフィールド値ファクターを使用します。
監視とアラート
監視すべき主要メトリクス
class SearchMonitor:
"""
Monitor search quality in production
"""
def __init__(self):
self.metrics_store = MetricsStore()
def track_search(self, query, results, latency, user_id=None):
"""
Track metrics for each search
"""
metrics = {
# Performance
'latency_ms': latency,
'result_count': len(results),
# Quality signals
'zero_results': len(results) == 0,
'top_score': results[0].score if results else 0,
'score_gap': self.score_gap(results),
# Query characteristics
'query_length': len(query.split()),
'query_type': classify_query(query),
}
self.metrics_store.record(metrics)
def get_dashboard_metrics(self, time_range='1h'):
"""
Aggregate metrics for dashboard
"""
data = self.metrics_store.get_range(time_range)
return {
# Performance
'p50_latency': np.percentile(data['latency_ms'], 50),
'p99_latency': np.percentile(data['latency_ms'], 99),
'qps': len(data) / time_range_seconds,
# Quality
'zero_result_rate': np.mean(data['zero_results']),
'ctr': np.mean(data['clicked']),
'avg_click_position': np.mean(data['click_position']),
'satisfaction_rate': np.mean(data['satisfied']),
# Trends
'ctr_change': self.calculate_change('ctr', time_range),
'zero_result_change': self.calculate_change('zero_result_rate', time_range),
}
# Alert thresholds
ALERT_THRESHOLDS = {
'zero_result_rate': {'warning': 0.05, 'critical': 0.10},
'p99_latency': {'warning': 500, 'critical': 1000}, # ms
'ctr': {'warning_drop': 0.10, 'critical_drop': 0.20}, # Relative change
}自動品質チェック
class QualityChecker:
"""
Automated checks for search quality regressions
"""
def __init__(self, golden_queries):
"""
golden_queries: List of (query, expected_top_results)
"""
self.golden_queries = golden_queries
def run_golden_query_check(self, search_func):
"""
Check if expected results still appear in top positions
"""
failures = []
for query, expected in self.golden_queries:
results = search_func(query, top_k=10)
result_ids = [r.id for r in results]
for expected_doc, expected_position in expected:
if expected_doc not in result_ids:
failures.append({
'query': query,
'expected_doc': expected_doc,
'error': 'MISSING'
})
else:
actual_pos = result_ids.index(expected_doc)
if actual_pos > expected_position + 2:
failures.append({
'query': query,
'expected_doc': expected_doc,
'expected_position': expected_position,
'actual_position': actual_pos,
'error': 'POSITION_DROP'
})
return failuresゴールデンクエリチェックは、期待される結果が依然として上位に表示されるかを確認します。リグレッションテストは、新システムと旧システムを比較して品質の後退を検出します。
ベストプラクティス
Query Understanding:
□ Implement spell correction with confidence thresholds
□ Use query classification for different ranking strategies
□ Segment queries to identify entities and phrases
□ Have a query relaxation strategy for zero results
Offline Evaluation:
□ Build representative test collection (head + torso + tail)
□ Use graded relevance judgments (not just binary)
□ Report multiple metrics (NDCG, MRR, MAP)
□ Segment analysis by query type and complexity
Online Experimentation:
□ Calculate required sample size before launching
□ Use proper randomization (user-level, not request-level)
□ Guard against novelty effects (run for 2+ weeks)
□ Consider interleaving for sensitivity
Monitoring:
□ Track leading indicators (zero results, latency)
□ Set up alerts with clear ownership
□ Run automated golden query checks
□ Regular manual search quality reviews
Iteration:
□ Prioritize fixes by impact × effort
□ Document all changes and their effects
□ Build regression test suite over time
□ Establish a regular tuning cadenceクエリ理解: 信頼度閾値付きのスペル修正を実装し、異なるランキング戦略のためのクエリ分類を使用し、ゼロ結果のためのクエリ緩和戦略を持ちます。
オフライン評価: 代表的なテストコレクションを構築し、段階的な関連性判定を使用し、複数のメトリクスを報告し、クエリタイプと複雑度によるセグメント分析を行います。
オンライン実験: 開始前に必要なサンプルサイズを計算し、適切なランダム化を使用し、新奇性効果に注意し(2週間以上実施)、感度のためにインターリービングを検討します。
監視: 先行指標を追跡し、明確なオーナーシップのあるアラートを設定し、自動ゴールデンクエリチェックを実行し、定期的な手動検索品質レビューを行います。