Retrieval Science: Finding Context
Introduction
We've built a knowledge base and created search functionality. Now we enter the heart of RAG: the retrieval phase. This is where we take a user's question, find the most relevant information from our knowledge base, and prepare context for the LLM.
This lesson explores the science behind retrieval—how similarity search works mathematically, what "Top-K" retrieval means, and how the retrieval query flows through our Next.js API route.
Vector Similarity Search Deep Dive
The Mathematical Foundation
At its core, vector similarity search asks: "Which stored vectors are closest to my query vector?"
For two vectors A and B, cosine similarity is calculated as:
similarity(A, B) = (A · B) / (||A|| × ||B||)
Where:
A · Bis the dot product: Σ(aᵢ × bᵢ)||A||is the magnitude: √(Σaᵢ²)
Example:
A = [0.5, 0.3, 0.8]
B = [0.4, 0.3, 0.9]
Dot product = (0.5×0.4) + (0.3×0.3) + (0.8×0.9) = 0.2 + 0.09 + 0.72 = 1.01
||A|| = √(0.25 + 0.09 + 0.64) = √0.98 ≈ 0.99
||B|| = √(0.16 + 0.09 + 0.81) = √1.06 ≈ 1.03
Similarity = 1.01 / (0.99 × 1.03) ≈ 0.99
These vectors are very similar (0.99 out of 1.0).
Why Cosine Similarity for Text?
Direction Over Magnitude:
Text embeddings capture meaning in their direction. Two documents about "machine learning" will point in similar directions, regardless of their length.
If we used Euclidean distance:
- A short document might be "closer" to a query than a long document
- Length would affect retrieval, even though meaning is the same
Cosine similarity normalizes for magnitude, focusing purely on semantic direction.
Bounded Range:
Cosine similarity always falls between -1 and 1 (or 0 and 1 for non-negative embeddings like Gemini's):
- 1.0 = identical direction
- 0.0 = perpendicular (unrelated)
- -1.0 = opposite direction (rare in practice)
This bounded range makes thresholds meaningful and comparable across queries.
Converting Distance to Similarity
pgvector's <=> operator returns cosine distance, not similarity:
cosine_distance = 1 - cosine_similarity
So:
- Distance 0 = Similarity 1 (identical)
- Distance 1 = Similarity 0 (perpendicular)
- Distance 2 = Similarity -1 (opposite)
We convert in our search function:
1 - (d.embedding <=> query_embedding) AS similarity
Top-K Retrieval
What is Top-K?
Top-K retrieval means retrieving the K most similar documents to the query. K is a hyperparameter you choose.
// Retrieve top 5 most similar documents
const results = await searchDocuments(queryEmbedding, { matchCount: 5 });
Choosing K
The optimal K depends on several factors:
Context Window Budget:
If you allocate 2000 tokens for context and each chunk is ~400 tokens, you can include ~5 chunks. More chunks = more context = better answers, but also:
- Higher cost
- Potentially slower generation
- Risk of including irrelevant content
Content Redundancy:
If your documents have overlapping content (e.g., multiple pages discussing the same topic), higher K ensures you capture all relevant angles.
Query Complexity:
Simple factual queries may need only 1-2 chunks:
- "What is the API rate limit?" → One chunk likely has the answer
Complex queries may need more context:
- "How does authentication differ between v1 and v2?" → Multiple sections relevant
K Selection Guidelines
| Query Type | Recommended K | Reasoning |
|---|---|---|
| Simple factual | 2-3 | One chunk likely sufficient |
| Procedural how-to | 3-5 | Steps may span chunks |
| Comparative | 5-8 | Need multiple sources |
| Open-ended exploration | 8-10 | Cast a wide net |
Practical Default: Start with K=5. Monitor retrieval quality and adjust based on:
- Are answers missing information? → Increase K
- Are answers including irrelevant info? → Decrease K or add similarity threshold
Similarity Threshold
Beyond K, you can filter by minimum similarity:
const results = await searchDocuments(queryEmbedding, {
matchCount: 10, // Fetch up to 10
similarityThreshold: 0.6 // But only if similarity >= 0.6
});
Why use a threshold?
If a query has no relevant documents, returning the "Top-5" would return the least irrelevant documents—which could still be completely off-topic.
A threshold prevents this:
- Query: "How do I bake a cake?"
- Knowledge base: Technical documentation
- Without threshold: Returns random tech docs
- With threshold: Returns nothing (correctly indicates no relevant info)
Threshold Guidelines:
| Threshold | Effect |
|---|---|
| < 0.5 | Very permissive, includes loosely related content |
| 0.5 - 0.7 | Balanced, typical for most applications |
| > 0.7 | Strict, only highly relevant content |
| > 0.8 | Very strict, may miss relevant content |
The Retrieval Flow in Next.js
The Complete Retrieval Query
Let's trace through the retrieval process in detail:
// app/api/search/route.ts
export async function POST(request: Request) {
// Step 1: Parse the incoming request
const { query } = await request.json();
if (!query || typeof query !== 'string') {
return Response.json(
{ error: 'Query is required' },
{ status: 400 }
);
}
// Step 2: Generate query embedding
// Note: Use RETRIEVAL_QUERY task type
const queryEmbedding = await generateEmbedding(query, 'RETRIEVAL_QUERY');
// Step 3: Search the vector database
const { data: results, error } = await supabase.rpc('search_docs', {
query_embedding: queryEmbedding,
match_count: 5,
similarity_threshold: 0.6
});
if (error) {
console.error('Search error:', error);
return Response.json(
{ error: 'Search failed' },
{ status: 500 }
);
}
// Step 4: Return results
return Response.json({
results: results.map(r => ({
content: r.content,
source: r.source,
title: r.title,
similarity: r.similarity
})),
query: query,
embedding_dimensions: queryEmbedding.length
});
}
Architecture Diagram
┌──────────────────────────────────────────────────────────────┐
│ USER'S BROWSER │
│ │
│ "How do I configure authentication?" │
└───────────────────────────┬──────────────────────────────────┘
│ POST /api/search
▼
┌──────────────────────────────────────────────────────────────┐
│ NEXT.JS API ROUTE │
│ │
│ 1. Parse query from request body │
│ 2. Validate input │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ GEMINI EMBEDDING API │ │
│ │ "How do I configure authentication?" │ │
│ │ ↓ │ │
│ │ [0.023, -0.145, 0.087, ..., 0.234] (768 dims) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ SUPABASE RPC CALL │ │
│ │ search_docs(embedding, 5, 0.6) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
└────────────────────────────┼─────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ SUPABASE │
│ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ SELECT content, source, similarity │ │
│ │ FROM documents │ │
│ │ WHERE similarity >= 0.6 │ │
│ │ ORDER BY embedding <=> query_embedding │ │
│ │ LIMIT 5 │ │
│ └───────────────────────────────────────────────────┘ │
│ │
│ Vector Index (IVFFlat) → Fast approximate search │
│ │
│ Results: │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 1. "Authentication Configuration" (0.89) │ │
│ │ 2. "Setting Up User Auth" (0.85) │ │
│ │ 3. "OAuth Integration Guide" (0.78) │ │
│ │ 4. "Security Best Practices" (0.72) │ │
│ │ 5. "API Authentication Tokens" (0.68) │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ RESPONSE TO CLIENT │
│ │
│ { │
│ "results": [ │
│ { "content": "...", "source": "auth.md", sim: 0.89 }, │
│ { "content": "...", "source": "setup.md", sim: 0.85 }, │
│ ... │
│ ] │
│ } │
└──────────────────────────────────────────────────────────────┘
Timing Breakdown
Typical latencies for each step:
| Step | Typical Latency |
|---|---|
| Parse request | < 1ms |
| Generate embedding | 50-200ms |
| Vector search (indexed) | 10-50ms |
| Network overhead | 20-100ms |
| Total | 100-400ms |
Embedding generation is usually the slowest step. For interactive applications, this is acceptable but worth optimizing if needed (see Module 5).
Context Extraction
From Search Results to Context
The search returns chunks with metadata. We need to extract and format this for the LLM:
interface SearchResult {
content: string;
source: string;
title: string;
similarity: number;
}
function buildContext(results: SearchResult[]): string {
if (results.length === 0) {
return 'No relevant information found.';
}
return results
.map((r, i) => {
return `[Document ${i + 1}]
Source: ${r.source}
Title: ${r.title}
Relevance: ${(r.similarity * 100).toFixed(0)}%
${r.content}`;
})
.join('\n\n---\n\n');
}
Example Output:
[Document 1]
Source: authentication.md
Title: Setting Up Authentication
Relevance: 89%
To configure authentication in your application, you need to first
create an auth provider in your Supabase dashboard...
---
[Document 2]
Source: security-guide.md
Title: Security Best Practices
Relevance: 85%
Always use environment variables for your API keys. Never commit
credentials to version control...
Context Structuring Strategies
Numbered Documents: Clearly delineate each source. Helps the LLM understand there are multiple sources and can reference them by number.
Relevance Indicators: Including similarity scores (as "Relevance") helps the LLM prioritize information from more relevant sources.
Source Attribution: Including source and title enables the LLM to cite sources in its response.
Handling No Results
When retrieval returns nothing (all below threshold), handle gracefully:
function buildContext(results: SearchResult[]): {
context: string;
hasResults: boolean;
} {
if (results.length === 0) {
return {
context: '',
hasResults: false
};
}
return {
context: results.map(formatResult).join('\n\n---\n\n'),
hasResults: true
};
}
// In the API route
const { context, hasResults } = buildContext(searchResults);
if (!hasResults) {
// Option 1: Tell the user directly
return Response.json({
answer: "I couldn't find relevant information in the knowledge base to answer this question.",
sources: []
});
// Option 2: Let the LLM respond without context
// (may hallucinate, use with caution)
}
Summary
In this lesson, we explored the science and practice of retrieval:
Key Takeaways:
-
Cosine similarity measures semantic direction: Perfect for comparing text embeddings regardless of length
-
Top-K retrieval balances coverage and focus: Start with K=5 and adjust based on quality
-
Similarity thresholds prevent irrelevant results: Use 0.5-0.7 for most applications
-
The retrieval flow is straightforward: Query → Embed → Search → Format context
-
Context extraction requires thoughtful formatting: Include metadata for attribution and structure for clarity
Next Steps
We have our context. Now we need to use it effectively. In the next lesson, we'll explore Prompt Engineering for Grounding—how to construct prompts that ensure the LLM answers based on retrieved context rather than its training knowledge.
"The right context makes all the difference between a helpful answer and a harmful hallucination." — Unknown

