Module 10: Metadata and Hybrid Search
Combining Semantic and Keyword Search
Introduction
Neither pure semantic search nor pure keyword search is perfect. Hybrid search combines both for better results.
By the end of this module, you'll understand:
- How to design effective metadata schemas
- What hybrid search is and why it matters
- How to implement hybrid search
- When to use each approach
10.1 Metadata Schema Design
What is Metadata?
Metadata is structured information stored alongside vectors:
{
id: 'doc-123',
values: [0.1, -0.2, ...], // Vector
metadata: { // Metadata
title: 'Introduction to Vector Databases',
author: 'Jane Smith',
category: 'technology',
publishDate: '2024-01-15',
wordCount: 1500,
tags: ['databases', 'ai', 'search'],
source: 'blog'
}
}
Metadata Best Practices
1. Keep it Lean
// Bad: Too much data
metadata: {
fullContent: '10,000 words...', // Store separately
base64Image: 'data:image/...', // Don't do this
nestedObject: { deep: { nested: { data: '...' } } } // Avoid deep nesting
}
// Good: Just what you need for filtering
metadata: {
title: 'Document Title',
category: 'tech',
authorId: 'author-123', // Reference, not full data
publishDate: '2024-01-15'
}
2. Use Consistent Types
// Bad: Inconsistent types
metadata: { price: "29.99" } // String
metadata: { price: 29.99 } // Number
metadata: { price: "$29.99" } // Different format
// Good: Consistent types
metadata: { price: 29.99 } // Always number
3. Flatten When Possible
// Bad: Nested structure
metadata: {
author: { name: 'Jane', id: '123' }
}
// Good: Flat structure
metadata: {
authorName: 'Jane',
authorId: '123'
}
4. Index-Friendly Fields
// Good for filtering
metadata: {
category: 'technology', // Exact match
price: 29.99, // Range queries
inStock: true, // Boolean filter
tags: ['ai', 'ml'] // Array contains
}
Metadata Size Limits
| Database | Metadata Limit |
|---|---|
| Pinecone | 40KB per vector |
| Qdrant | No hard limit |
| Weaviate | No hard limit |
| Chroma | No hard limit |
| pgvector | PostgreSQL limits |
10.2 The Semantic vs Keyword Problem
When Semantic Search Fails
// Query: "error code 12345"
// Semantic search might find:
// - "How to fix common errors" (conceptually related)
// - "Troubleshooting guide" (conceptually related)
// But user wanted:
// - "Error 12345: Database connection failed" (exact match)
Semantic search understands meaning but can miss:
- Exact matches (product codes, error numbers)
- Proper nouns
- Technical terms
- Specific phrases
When Keyword Search Fails
// Query: "laptop for video editing"
// Keyword search finds:
// - "Laptop stands" (contains "laptop")
// - "Video editing tutorial" (contains "video editing")
// But user wanted:
// - "MacBook Pro M3 for creative professionals"
// - "High-performance workstation"
Keyword search matches words but misses:
- Synonyms
- Conceptual relationships
- Natural language variations
The Solution: Hybrid Search
Combine both approaches:
- Run semantic (vector) search
- Run keyword (full-text) search
- Merge and re-rank results
10.3 Implementing Hybrid Search
Basic Approach: Score Fusion
async function hybridSearch(
query: string,
alpha: number = 0.5 // Weight: 0 = all keyword, 1 = all semantic
): Promise<Result[]> {
// Run both searches in parallel
const [semanticResults, keywordResults] = await Promise.all([
semanticSearch(query, 50),
keywordSearch(query, 50)
])
// Normalize scores to 0-1 range
const normalizedSemantic = normalizeScores(semanticResults)
const normalizedKeyword = normalizeScores(keywordResults)
// Merge results
const merged = new Map<string, { semantic: number; keyword: number }>()
for (const r of normalizedSemantic) {
merged.set(r.id, { semantic: r.score, keyword: 0 })
}
for (const r of normalizedKeyword) {
const existing = merged.get(r.id)
if (existing) {
existing.keyword = r.score
} else {
merged.set(r.id, { semantic: 0, keyword: r.score })
}
}
// Calculate combined scores
const combined = Array.from(merged.entries()).map(([id, scores]) => ({
id,
score: alpha * scores.semantic + (1 - alpha) * scores.keyword
}))
return combined.sort((a, b) => b.score - a.score).slice(0, 10)
}
function normalizeScores(results: Result[]): Result[] {
if (results.length === 0) return []
const max = Math.max(...results.map(r => r.score))
const min = Math.min(...results.map(r => r.score))
const range = max - min || 1
return results.map(r => ({
...r,
score: (r.score - min) / range
}))
}
Reciprocal Rank Fusion (RRF)
A more sophisticated merging strategy:
function reciprocalRankFusion(
resultSets: Result[][],
k: number = 60
): Result[] {
const scores = new Map<string, number>()
for (const results of resultSets) {
for (let rank = 0; rank < results.length; rank++) {
const doc = results[rank]
const rrf = 1 / (k + rank + 1)
scores.set(doc.id, (scores.get(doc.id) || 0) + rrf)
}
}
return Array.from(scores.entries())
.map(([id, score]) => ({ id, score }))
.sort((a, b) => b.score - a.score)
}
// Usage
const semantic = await semanticSearch(query, 50)
const keyword = await keywordSearch(query, 50)
const hybrid = reciprocalRankFusion([semantic, keyword])
Database-Native Hybrid Search
Some databases support hybrid search natively:
Pinecone (Sparse-Dense Vectors):
// Sparse vectors for keyword matching
const sparseEmbedding = await getSparseEmbedding(query) // BM25 or similar
const denseEmbedding = await getDenseEmbedding(query) // OpenAI, etc.
await index.upsert([{
id: 'doc-1',
values: denseEmbedding,
sparseValues: {
indices: sparseEmbedding.indices,
values: sparseEmbedding.values
},
metadata: { ... }
}])
const results = await index.query({
vector: denseEmbedding,
sparseVector: sparseEmbedding,
topK: 10
})
Qdrant:
await client.search('collection', {
vector: queryEmbedding,
limit: 10,
query_filter: {
must: [{
key: 'content',
match: { text: 'specific phrase' } // Keyword match
}]
}
})
pgvector + Full-Text Search:
-- Combine vector similarity with full-text search
SELECT id, content,
(1 - (embedding <=> $1)) * 0.5 +
ts_rank(to_tsvector(content), plainto_tsquery($2)) * 0.5 as score
FROM documents
WHERE to_tsvector(content) @@ plainto_tsquery($2)
ORDER BY score DESC
LIMIT 10;
10.4 When to Use Hybrid Search
Use Hybrid When:
-
Mixed Query Types
- Some queries are conceptual ("how to fix errors")
- Some are specific ("error code 12345")
-
Technical Content
- Code, APIs, product names
- Specific terminology
-
Multi-language Content
- Keywords in one language, meaning in another
-
High Precision Required
- Can't afford to miss exact matches
- Legal, medical, compliance
Stick with Pure Semantic When:
-
Conversational Queries
- Natural language questions
- Concept-based search
-
Recommendation Systems
- "Similar to" queries
- User preference matching
-
Simple Use Cases
- Less complexity to manage
- Faster queries
10.5 Tuning Hybrid Search
Finding the Right Alpha
async function tuneAlpha(
testQueries: Array<{ query: string; relevantDocs: string[] }>
): Promise<number> {
const alphas = [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]
const results: Array<{ alpha: number; recall: number }> = []
for (const alpha of alphas) {
let totalRecall = 0
for (const { query, relevantDocs } of testQueries) {
const searchResults = await hybridSearch(query, alpha)
const foundDocs = searchResults.map(r => r.id)
const recall = relevantDocs.filter(d => foundDocs.includes(d)).length
/ relevantDocs.length
totalRecall += recall
}
results.push({
alpha,
recall: totalRecall / testQueries.length
})
}
const best = results.reduce((a, b) => a.recall > b.recall ? a : b)
console.log('Best alpha:', best.alpha, 'with recall:', best.recall)
return best.alpha
}
Query-Dependent Alpha
Adjust alpha based on query characteristics:
function determineAlpha(query: string): number {
// Detect query type
const hasNumbers = /\d+/.test(query)
const hasQuotes = /"[^"]+"/.test(query)
const isQuestion = query.endsWith('?')
const isShort = query.split(' ').length < 4
// More keyword-heavy for specific queries
if (hasNumbers || hasQuotes) return 0.3
// More semantic for questions
if (isQuestion) return 0.8
// Balanced for short queries
if (isShort) return 0.5
// Default balanced
return 0.5
}
10.6 Complete Hybrid Search Example
import { Pool } from 'pg'
import OpenAI from 'openai'
const pool = new Pool({ connectionString: process.env.DATABASE_URL })
const openai = new OpenAI()
interface HybridResult {
id: string
title: string
content: string
semanticScore: number
keywordScore: number
combinedScore: number
}
async function hybridSearchPgvector(
query: string,
options: {
limit?: number
alpha?: number
category?: string
} = {}
): Promise<HybridResult[]> {
const { limit = 10, alpha = 0.5, category } = options
// Generate embedding
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query
})
const embedding = response.data[0].embedding
// Build category filter
const categoryFilter = category ? `AND category = $4` : ''
// Hybrid query combining vector similarity and full-text search
const sql = `
WITH semantic AS (
SELECT id, title, content, category,
1 - (embedding <=> $1::vector) as score
FROM documents
WHERE true ${categoryFilter}
ORDER BY embedding <=> $1::vector
LIMIT 50
),
keyword AS (
SELECT id, title, content, category,
ts_rank_cd(
to_tsvector('english', title || ' ' || content),
plainto_tsquery('english', $2)
) as score
FROM documents
WHERE to_tsvector('english', title || ' ' || content)
@@ plainto_tsquery('english', $2)
${categoryFilter}
ORDER BY score DESC
LIMIT 50
),
combined AS (
SELECT
COALESCE(s.id, k.id) as id,
COALESCE(s.title, k.title) as title,
COALESCE(s.content, k.content) as content,
COALESCE(s.score, 0) as semantic_score,
COALESCE(k.score, 0) as keyword_score
FROM semantic s
FULL OUTER JOIN keyword k ON s.id = k.id
)
SELECT *,
($3 * semantic_score + (1 - $3) * keyword_score) as combined_score
FROM combined
ORDER BY combined_score DESC
LIMIT $5
`
const params = category
? [JSON.stringify(embedding), query, alpha, category, limit]
: [JSON.stringify(embedding), query, alpha, limit]
// Adjust param indices if no category
const adjustedSql = category ? sql : sql.replace(/\$5/g, '$4')
const result = await pool.query(adjustedSql, params)
return result.rows.map(row => ({
id: row.id,
title: row.title,
content: row.content,
semanticScore: row.semantic_score,
keywordScore: row.keyword_score,
combinedScore: row.combined_score
}))
}
// Usage
async function main() {
// Semantic-heavy query
const results1 = await hybridSearchPgvector(
'How do I fix database connection issues?',
{ alpha: 0.8 }
)
// Keyword-heavy query
const results2 = await hybridSearchPgvector(
'error code ECONNREFUSED',
{ alpha: 0.3 }
)
// Balanced
const results3 = await hybridSearchPgvector(
'postgres connection timeout',
{ alpha: 0.5, category: 'troubleshooting' }
)
}
Key Takeaways
- Design metadata for filtering, not storage
- Hybrid search combines semantic and keyword matching
- RRF is a robust merging strategy
- Tune alpha based on your query patterns
- Some databases support hybrid natively
Exercise: Implement Hybrid Search
- Set up a PostgreSQL database with pgvector and full-text search
- Create a table with sample documents
- Implement hybrid search with adjustable alpha
- Test with queries that benefit from each approach
- Find the optimal alpha for your test set
Next up: Module 11 - Performance Optimization

