Retrieval Science: Finding Context

Introduction

We've built a knowledge base and created search functionality. Now we enter the heart of RAG: the retrieval phase. This is where we take a user's question, find the most relevant information from our knowledge base, and prepare context for the LLM.

This lesson explores the science behind retrieval—how similarity search works mathematically, what "Top-K" retrieval means, and how the retrieval query flows through our Next.js API route.

Vector Similarity Search Deep Dive

The Mathematical Foundation

At its core, vector similarity search asks: "Which stored vectors are closest to my query vector?"

For two vectors A and B, cosine similarity is calculated as:

similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:

A · B is the dot product: Σ(aᵢ × bᵢ)
||A|| is the magnitude: √(Σaᵢ²)

Example:

A = [0.5, 0.3, 0.8]
B = [0.4, 0.3, 0.9]

Dot product = (0.5×0.4) + (0.3×0.3) + (0.8×0.9) = 0.2 + 0.09 + 0.72 = 1.01
||A|| = √(0.25 + 0.09 + 0.64) = √0.98 ≈ 0.99
||B|| = √(0.16 + 0.09 + 0.81) = √1.06 ≈ 1.03

Similarity = 1.01 / (0.99 × 1.03) ≈ 0.99

These vectors are very similar (0.99 out of 1.0).

Why Cosine Similarity for Text?

Direction Over Magnitude:

Text embeddings capture meaning in their direction. Two documents about "machine learning" will point in similar directions, regardless of their length.

If we used Euclidean distance:

A short document might be "closer" to a query than a long document
Length would affect retrieval, even though meaning is the same

Cosine similarity normalizes for magnitude, focusing purely on semantic direction.

Bounded Range:

Cosine similarity always falls between -1 and 1 (or 0 and 1 for non-negative embeddings like Gemini's):

1.0 = identical direction
0.0 = perpendicular (unrelated)
-1.0 = opposite direction (rare in practice)

This bounded range makes thresholds meaningful and comparable across queries.

Converting Distance to Similarity

pgvector's <=> operator returns cosine distance, not similarity:

cosine_distance = 1 - cosine_similarity

So:

Distance 0 = Similarity 1 (identical)
Distance 1 = Similarity 0 (perpendicular)
Distance 2 = Similarity -1 (opposite)

We convert in our search function:

1 - (d.embedding <=> query_embedding) AS similarity

Top-K Retrieval

What is Top-K?

Top-K retrieval means retrieving the K most similar documents to the query. K is a hyperparameter you choose.

// Retrieve top 5 most similar documents
const results = await searchDocuments(queryEmbedding, { matchCount: 5 });

Choosing K

The optimal K depends on several factors:

Context Window Budget:

If you allocate 2000 tokens for context and each chunk is ~400 tokens, you can include ~5 chunks. More chunks = more context = better answers, but also:

Higher cost
Potentially slower generation
Risk of including irrelevant content

Content Redundancy:

If your documents have overlapping content (e.g., multiple pages discussing the same topic), higher K ensures you capture all relevant angles.

Query Complexity:

Simple factual queries may need only 1-2 chunks:

"What is the API rate limit?" → One chunk likely has the answer

Complex queries may need more context:

"How does authentication differ between v1 and v2?" → Multiple sections relevant

K Selection Guidelines

Query Type	Recommended K	Reasoning
Simple factual	2-3	One chunk likely sufficient
Procedural how-to	3-5	Steps may span chunks
Comparative	5-8	Need multiple sources
Open-ended exploration	8-10	Cast a wide net

Practical Default: Start with K=5. Monitor retrieval quality and adjust based on:

Are answers missing information? → Increase K
Are answers including irrelevant info? → Decrease K or add similarity threshold

Similarity Threshold

Beyond K, you can filter by minimum similarity:

const results = await searchDocuments(queryEmbedding, {
  matchCount: 10,           // Fetch up to 10
  similarityThreshold: 0.6  // But only if similarity >= 0.6
});

Why use a threshold?

If a query has no relevant documents, returning the "Top-5" would return the least irrelevant documents—which could still be completely off-topic.

A threshold prevents this:

Query: "How do I bake a cake?"
Knowledge base: Technical documentation
Without threshold: Returns random tech docs
With threshold: Returns nothing (correctly indicates no relevant info)

Threshold Guidelines:

Threshold	Effect
< 0.5	Very permissive, includes loosely related content
0.5 - 0.7	Balanced, typical for most applications
> 0.7	Strict, only highly relevant content
> 0.8	Very strict, may miss relevant content

The Retrieval Flow in Next.js

The Complete Retrieval Query

Let's trace through the retrieval process in detail:

// app/api/search/route.ts

export async function POST(request: Request) {
  // Step 1: Parse the incoming request
  const { query } = await request.json();

  if (!query || typeof query !== 'string') {
    return Response.json(
      { error: 'Query is required' },
      { status: 400 }
    );
  }

  // Step 2: Generate query embedding
  // Note: Use RETRIEVAL_QUERY task type
  const queryEmbedding = await generateEmbedding(query, 'RETRIEVAL_QUERY');

  // Step 3: Search the vector database
  const { data: results, error } = await supabase.rpc('search_docs', {
    query_embedding: queryEmbedding,
    match_count: 5,
    similarity_threshold: 0.6
  });

  if (error) {
    console.error('Search error:', error);
    return Response.json(
      { error: 'Search failed' },
      { status: 500 }
    );
  }

  // Step 4: Return results
  return Response.json({
    results: results.map(r => ({
      content: r.content,
      source: r.source,
      title: r.title,
      similarity: r.similarity
    })),
    query: query,
    embedding_dimensions: queryEmbedding.length
  });
}

Architecture Diagram

┌──────────────────────────────────────────────────────────────┐
│                     USER'S BROWSER                           │
│                                                              │
│  "How do I configure authentication?"                       │
└───────────────────────────┬──────────────────────────────────┘
                            │ POST /api/search
                            ▼
┌──────────────────────────────────────────────────────────────┐
│                    NEXT.JS API ROUTE                         │
│                                                              │
│  1. Parse query from request body                           │
│  2. Validate input                                          │
│                            │                                 │
│                            ▼                                 │
│  ┌──────────────────────────────────────────────────┐       │
│  │           GEMINI EMBEDDING API                   │       │
│  │  "How do I configure authentication?"            │       │
│  │           ↓                                      │       │
│  │  [0.023, -0.145, 0.087, ..., 0.234] (768 dims)  │       │
│  └──────────────────────────────────────────────────┘       │
│                            │                                 │
│                            ▼                                 │
│  ┌──────────────────────────────────────────────────┐       │
│  │           SUPABASE RPC CALL                      │       │
│  │  search_docs(embedding, 5, 0.6)                  │       │
│  └──────────────────────────────────────────────────┘       │
│                            │                                 │
└────────────────────────────┼─────────────────────────────────┘
                             ▼
┌──────────────────────────────────────────────────────────────┐
│                      SUPABASE                                │
│                                                              │
│  ┌───────────────────────────────────────────────────┐      │
│  │  SELECT content, source, similarity               │      │
│  │  FROM documents                                   │      │
│  │  WHERE similarity >= 0.6                          │      │
│  │  ORDER BY embedding <=> query_embedding           │      │
│  │  LIMIT 5                                          │      │
│  └───────────────────────────────────────────────────┘      │
│                                                              │
│  Vector Index (IVFFlat) → Fast approximate search           │
│                                                              │
│  Results:                                                    │
│  ┌─────────────────────────────────────────────────┐        │
│  │ 1. "Authentication Configuration" (0.89)         │        │
│  │ 2. "Setting Up User Auth" (0.85)                │        │
│  │ 3. "OAuth Integration Guide" (0.78)             │        │
│  │ 4. "Security Best Practices" (0.72)             │        │
│  │ 5. "API Authentication Tokens" (0.68)           │        │
│  └─────────────────────────────────────────────────┘        │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│                    RESPONSE TO CLIENT                        │
│                                                              │
│  {                                                           │
│    "results": [                                              │
│      { "content": "...", "source": "auth.md", sim: 0.89 },  │
│      { "content": "...", "source": "setup.md", sim: 0.85 }, │
│      ...                                                     │
│    ]                                                         │
│  }                                                           │
└──────────────────────────────────────────────────────────────┘

Timing Breakdown

Typical latencies for each step:

Step	Typical Latency
Parse request	< 1ms
Generate embedding	50-200ms
Vector search (indexed)	10-50ms
Network overhead	20-100ms
Total	100-400ms

Embedding generation is usually the slowest step. For interactive applications, this is acceptable but worth optimizing if needed (see Module 5).

Context Extraction

From Search Results to Context

The search returns chunks with metadata. We need to extract and format this for the LLM:

interface SearchResult {
  content: string;
  source: string;
  title: string;
  similarity: number;
}

function buildContext(results: SearchResult[]): string {
  if (results.length === 0) {
    return 'No relevant information found.';
  }

  return results
    .map((r, i) => {
      return `[Document ${i + 1}]
Source: ${r.source}
Title: ${r.title}
Relevance: ${(r.similarity * 100).toFixed(0)}%

${r.content}`;
    })
    .join('\n\n---\n\n');
}

Example Output:

[Document 1]
Source: authentication.md
Title: Setting Up Authentication
Relevance: 89%

To configure authentication in your application, you need to first
create an auth provider in your Supabase dashboard...

---

[Document 2]
Source: security-guide.md
Title: Security Best Practices
Relevance: 85%

Always use environment variables for your API keys. Never commit
credentials to version control...

Context Structuring Strategies

Numbered Documents: Clearly delineate each source. Helps the LLM understand there are multiple sources and can reference them by number.

Relevance Indicators: Including similarity scores (as "Relevance") helps the LLM prioritize information from more relevant sources.

Source Attribution: Including source and title enables the LLM to cite sources in its response.

Handling No Results

When retrieval returns nothing (all below threshold), handle gracefully:

function buildContext(results: SearchResult[]): {
  context: string;
  hasResults: boolean;
} {
  if (results.length === 0) {
    return {
      context: '',
      hasResults: false
    };
  }

  return {
    context: results.map(formatResult).join('\n\n---\n\n'),
    hasResults: true
  };
}

// In the API route
const { context, hasResults } = buildContext(searchResults);

if (!hasResults) {
  // Option 1: Tell the user directly
  return Response.json({
    answer: "I couldn't find relevant information in the knowledge base to answer this question.",
    sources: []
  });

  // Option 2: Let the LLM respond without context
  // (may hallucinate, use with caution)
}

Summary

In this lesson, we explored the science and practice of retrieval:

Key Takeaways:

Cosine similarity measures semantic direction: Perfect for comparing text embeddings regardless of length
Top-K retrieval balances coverage and focus: Start with K=5 and adjust based on quality
Similarity thresholds prevent irrelevant results: Use 0.5-0.7 for most applications
The retrieval flow is straightforward: Query → Embed → Search → Format context
Context extraction requires thoughtful formatting: Include metadata for attribution and structure for clarity

Next Steps

We have our context. Now we need to use it effectively. In the next lesson, we'll explore Prompt Engineering for Grounding—how to construct prompts that ensure the LLM answers based on retrieved context rather than its training knowledge.

"The right context makes all the difference between a helpful answer and a harmful hallucination." — Unknown