Module 10: Metadata and Hybrid Search

Combining Semantic and Keyword Search

Introduction

Neither pure semantic search nor pure keyword search is perfect. Hybrid search combines both for better results.

By the end of this module, you'll understand:

How to design effective metadata schemas
What hybrid search is and why it matters
How to implement hybrid search
When to use each approach

10.1 Metadata Schema Design

What is Metadata?

Metadata is structured information stored alongside vectors:

{
  id: 'doc-123',
  values: [0.1, -0.2, ...],  // Vector
  metadata: {                 // Metadata
    title: 'Introduction to Vector Databases',
    author: 'Jane Smith',
    category: 'technology',
    publishDate: '2024-01-15',
    wordCount: 1500,
    tags: ['databases', 'ai', 'search'],
    source: 'blog'
  }
}

Metadata Best Practices

1. Keep it Lean

// Bad: Too much data
metadata: {
  fullContent: '10,000 words...',  // Store separately
  base64Image: 'data:image/...',   // Don't do this
  nestedObject: { deep: { nested: { data: '...' } } }  // Avoid deep nesting
}

// Good: Just what you need for filtering
metadata: {
  title: 'Document Title',
  category: 'tech',
  authorId: 'author-123',  // Reference, not full data
  publishDate: '2024-01-15'
}

2. Use Consistent Types

// Bad: Inconsistent types
metadata: { price: "29.99" }  // String
metadata: { price: 29.99 }    // Number
metadata: { price: "$29.99" } // Different format

// Good: Consistent types
metadata: { price: 29.99 }    // Always number

3. Flatten When Possible

// Bad: Nested structure
metadata: {
  author: { name: 'Jane', id: '123' }
}

// Good: Flat structure
metadata: {
  authorName: 'Jane',
  authorId: '123'
}

4. Index-Friendly Fields

// Good for filtering
metadata: {
  category: 'technology',     // Exact match
  price: 29.99,               // Range queries
  inStock: true,              // Boolean filter
  tags: ['ai', 'ml']          // Array contains
}

Metadata Size Limits

Database	Metadata Limit
Pinecone	40KB per vector
Qdrant	No hard limit
Weaviate	No hard limit
Chroma	No hard limit
pgvector	PostgreSQL limits

10.2 The Semantic vs Keyword Problem

When Semantic Search Fails

// Query: "error code 12345"
// Semantic search might find:
// - "How to fix common errors" (conceptually related)
// - "Troubleshooting guide" (conceptually related)

// But user wanted:
// - "Error 12345: Database connection failed" (exact match)

Semantic search understands meaning but can miss:

Exact matches (product codes, error numbers)
Proper nouns
Technical terms
Specific phrases

When Keyword Search Fails

// Query: "laptop for video editing"
// Keyword search finds:
// - "Laptop stands" (contains "laptop")
// - "Video editing tutorial" (contains "video editing")

// But user wanted:
// - "MacBook Pro M3 for creative professionals"
// - "High-performance workstation"

Keyword search matches words but misses:

Synonyms
Conceptual relationships
Natural language variations

The Solution: Hybrid Search

Combine both approaches:

Run semantic (vector) search
Run keyword (full-text) search
Merge and re-rank results

10.3 Implementing Hybrid Search

Basic Approach: Score Fusion

async function hybridSearch(
  query: string,
  alpha: number = 0.5  // Weight: 0 = all keyword, 1 = all semantic
): Promise<Result[]> {
  // Run both searches in parallel
  const [semanticResults, keywordResults] = await Promise.all([
    semanticSearch(query, 50),
    keywordSearch(query, 50)
  ])

  // Normalize scores to 0-1 range
  const normalizedSemantic = normalizeScores(semanticResults)
  const normalizedKeyword = normalizeScores(keywordResults)

  // Merge results
  const merged = new Map<string, { semantic: number; keyword: number }>()

  for (const r of normalizedSemantic) {
    merged.set(r.id, { semantic: r.score, keyword: 0 })
  }

  for (const r of normalizedKeyword) {
    const existing = merged.get(r.id)
    if (existing) {
      existing.keyword = r.score
    } else {
      merged.set(r.id, { semantic: 0, keyword: r.score })
    }
  }

  // Calculate combined scores
  const combined = Array.from(merged.entries()).map(([id, scores]) => ({
    id,
    score: alpha * scores.semantic + (1 - alpha) * scores.keyword
  }))

  return combined.sort((a, b) => b.score - a.score).slice(0, 10)
}

function normalizeScores(results: Result[]): Result[] {
  if (results.length === 0) return []
  const max = Math.max(...results.map(r => r.score))
  const min = Math.min(...results.map(r => r.score))
  const range = max - min || 1

  return results.map(r => ({
    ...r,
    score: (r.score - min) / range
  }))
}

Reciprocal Rank Fusion (RRF)

A more sophisticated merging strategy:

function reciprocalRankFusion(
  resultSets: Result[][],
  k: number = 60
): Result[] {
  const scores = new Map<string, number>()

  for (const results of resultSets) {
    for (let rank = 0; rank < results.length; rank++) {
      const doc = results[rank]
      const rrf = 1 / (k + rank + 1)
      scores.set(doc.id, (scores.get(doc.id) || 0) + rrf)
    }
  }

  return Array.from(scores.entries())
    .map(([id, score]) => ({ id, score }))
    .sort((a, b) => b.score - a.score)
}

// Usage
const semantic = await semanticSearch(query, 50)
const keyword = await keywordSearch(query, 50)
const hybrid = reciprocalRankFusion([semantic, keyword])

Database-Native Hybrid Search

Some databases support hybrid search natively:

Pinecone (Sparse-Dense Vectors):

// Sparse vectors for keyword matching
const sparseEmbedding = await getSparseEmbedding(query)  // BM25 or similar
const denseEmbedding = await getDenseEmbedding(query)     // OpenAI, etc.

await index.upsert([{
  id: 'doc-1',
  values: denseEmbedding,
  sparseValues: {
    indices: sparseEmbedding.indices,
    values: sparseEmbedding.values
  },
  metadata: { ... }
}])

const results = await index.query({
  vector: denseEmbedding,
  sparseVector: sparseEmbedding,
  topK: 10
})

Qdrant:

await client.search('collection', {
  vector: queryEmbedding,
  limit: 10,
  query_filter: {
    must: [{
      key: 'content',
      match: { text: 'specific phrase' }  // Keyword match
    }]
  }
})

pgvector + Full-Text Search:

-- Combine vector similarity with full-text search
SELECT id, content,
       (1 - (embedding <=> $1)) * 0.5 +
       ts_rank(to_tsvector(content), plainto_tsquery($2)) * 0.5 as score
FROM documents
WHERE to_tsvector(content) @@ plainto_tsquery($2)
ORDER BY score DESC
LIMIT 10;

10.4 When to Use Hybrid Search

Use Hybrid When:

Mixed Query Types
- Some queries are conceptual ("how to fix errors")
- Some are specific ("error code 12345")
Technical Content
- Code, APIs, product names
- Specific terminology
Multi-language Content
- Keywords in one language, meaning in another
High Precision Required
- Can't afford to miss exact matches
- Legal, medical, compliance

Stick with Pure Semantic When:

Conversational Queries
- Natural language questions
- Concept-based search
Recommendation Systems
- "Similar to" queries
- User preference matching
Simple Use Cases
- Less complexity to manage
- Faster queries

10.5 Tuning Hybrid Search

Finding the Right Alpha

async function tuneAlpha(
  testQueries: Array<{ query: string; relevantDocs: string[] }>
): Promise<number> {
  const alphas = [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]
  const results: Array<{ alpha: number; recall: number }> = []

  for (const alpha of alphas) {
    let totalRecall = 0

    for (const { query, relevantDocs } of testQueries) {
      const searchResults = await hybridSearch(query, alpha)
      const foundDocs = searchResults.map(r => r.id)
      const recall = relevantDocs.filter(d => foundDocs.includes(d)).length
                     / relevantDocs.length
      totalRecall += recall
    }

    results.push({
      alpha,
      recall: totalRecall / testQueries.length
    })
  }

  const best = results.reduce((a, b) => a.recall > b.recall ? a : b)
  console.log('Best alpha:', best.alpha, 'with recall:', best.recall)
  return best.alpha
}

Query-Dependent Alpha

Adjust alpha based on query characteristics:

function determineAlpha(query: string): number {
  // Detect query type
  const hasNumbers = /\d+/.test(query)
  const hasQuotes = /"[^"]+"/.test(query)
  const isQuestion = query.endsWith('?')
  const isShort = query.split(' ').length < 4

  // More keyword-heavy for specific queries
  if (hasNumbers || hasQuotes) return 0.3
  // More semantic for questions
  if (isQuestion) return 0.8
  // Balanced for short queries
  if (isShort) return 0.5

  // Default balanced
  return 0.5
}

10.6 Complete Hybrid Search Example

import { Pool } from 'pg'
import OpenAI from 'openai'

const pool = new Pool({ connectionString: process.env.DATABASE_URL })
const openai = new OpenAI()

interface HybridResult {
  id: string
  title: string
  content: string
  semanticScore: number
  keywordScore: number
  combinedScore: number
}

async function hybridSearchPgvector(
  query: string,
  options: {
    limit?: number
    alpha?: number
    category?: string
  } = {}
): Promise<HybridResult[]> {
  const { limit = 10, alpha = 0.5, category } = options

  // Generate embedding
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query
  })
  const embedding = response.data[0].embedding

  // Build category filter
  const categoryFilter = category ? `AND category = $4` : ''

  // Hybrid query combining vector similarity and full-text search
  const sql = `
    WITH semantic AS (
      SELECT id, title, content, category,
             1 - (embedding <=> $1::vector) as score
      FROM documents
      WHERE true ${categoryFilter}
      ORDER BY embedding <=> $1::vector
      LIMIT 50
    ),
    keyword AS (
      SELECT id, title, content, category,
             ts_rank_cd(
               to_tsvector('english', title || ' ' || content),
               plainto_tsquery('english', $2)
             ) as score
      FROM documents
      WHERE to_tsvector('english', title || ' ' || content)
            @@ plainto_tsquery('english', $2)
        ${categoryFilter}
      ORDER BY score DESC
      LIMIT 50
    ),
    combined AS (
      SELECT
        COALESCE(s.id, k.id) as id,
        COALESCE(s.title, k.title) as title,
        COALESCE(s.content, k.content) as content,
        COALESCE(s.score, 0) as semantic_score,
        COALESCE(k.score, 0) as keyword_score
      FROM semantic s
      FULL OUTER JOIN keyword k ON s.id = k.id
    )
    SELECT *,
           ($3 * semantic_score + (1 - $3) * keyword_score) as combined_score
    FROM combined
    ORDER BY combined_score DESC
    LIMIT $5
  `

  const params = category
    ? [JSON.stringify(embedding), query, alpha, category, limit]
    : [JSON.stringify(embedding), query, alpha, limit]

  // Adjust param indices if no category
  const adjustedSql = category ? sql : sql.replace(/\$5/g, '$4')

  const result = await pool.query(adjustedSql, params)

  return result.rows.map(row => ({
    id: row.id,
    title: row.title,
    content: row.content,
    semanticScore: row.semantic_score,
    keywordScore: row.keyword_score,
    combinedScore: row.combined_score
  }))
}

// Usage
async function main() {
  // Semantic-heavy query
  const results1 = await hybridSearchPgvector(
    'How do I fix database connection issues?',
    { alpha: 0.8 }
  )

  // Keyword-heavy query
  const results2 = await hybridSearchPgvector(
    'error code ECONNREFUSED',
    { alpha: 0.3 }
  )

  // Balanced
  const results3 = await hybridSearchPgvector(
    'postgres connection timeout',
    { alpha: 0.5, category: 'troubleshooting' }
  )
}

Key Takeaways

Design metadata for filtering, not storage
Hybrid search combines semantic and keyword matching
RRF is a robust merging strategy
Tune alpha based on your query patterns
Some databases support hybrid natively

Exercise: Implement Hybrid Search

Set up a PostgreSQL database with pgvector and full-text search
Create a table with sample documents
Implement hybrid search with adjustable alpha
Test with queries that benefit from each approach
Find the optimal alpha for your test set

Next up: Module 11 - Performance Optimization