The Role of Vector Embeddings
Introduction
In the previous lesson, we learned that RAG systems retrieve relevant information before generating responses. But how do we find "relevant" information? If a user asks "How do I reset my password?", we need to find documentation about password resets—even if the exact phrase "reset my password" doesn't appear in our documents.
The answer lies in vector embeddings—numerical representations of text that capture semantic meaning. This lesson explores how embeddings work, why they're essential for RAG, and how we'll use Google's Gemini embedding model in our applications.
Understanding Embeddings
From Words to Numbers
Computers work with numbers, not words. To process text computationally, we need to convert it into numerical form. But not all numerical representations are equal.
Naive Approach: Character Codes
"cat" → [99, 97, 116] (ASCII codes)
"dog" → [100, 111, 103]
This tells us nothing about meaning. "Cat" and "dog" are semantically similar (both animals, pets), but their numerical representations are arbitrary.
Better Approach: One-Hot Encoding
Vocabulary: [cat, dog, car, house]
"cat" → [1, 0, 0, 0]
"dog" → [0, 1, 0, 0]
Still no semantic information. Every word is equally different from every other word.
Best Approach: Embeddings
"cat" → [0.23, -0.45, 0.12, ..., 0.78] (768 dimensions)
"dog" → [0.25, -0.42, 0.15, ..., 0.76] (768 dimensions)
"car" → [-0.67, 0.23, 0.89, ..., -0.34] (768 dimensions)
Now "cat" and "dog" have similar vectors because they're semantically similar, while "car" is quite different.
What Are Vector Embeddings?
A vector embedding is a list of numbers (typically hundreds or thousands) that represents the meaning of a piece of text. Each number in the list is a dimension, and together they define a point in high-dimensional space.
Key Properties:
-
Semantic Similarity = Vector Similarity
- Similar meanings produce similar vectors
- "Laptop" and "computer" will be close together
- "Laptop" and "banana" will be far apart
-
Context Matters
- The same word can have different embeddings in different contexts
- "Bank" (financial) vs "bank" (river) would embed differently
-
Dense Representations
- Unlike sparse representations (one-hot encoding), embeddings pack meaning into every dimension
- More efficient storage and computation
Dimensionality: What the Numbers Mean
Embedding vectors typically have hundreds of dimensions. For example, Gemini's text-embedding-004 produces 768-dimensional vectors.
Each dimension captures some aspect of meaning, though not in a human-interpretable way. You might imagine dimensions for:
- Is this about technology?
- Is this formal or casual?
- Does this involve physical objects?
- Is there emotional content?
In reality, the dimensions emerge from training and don't map cleanly to human concepts. But they capture patterns that enable semantic comparison.
Visualizing High-Dimensional Space
Humans can visualize up to 3 dimensions. How do we think about 768 dimensions?
Mental Model: Semantic Neighborhoods
Think of embedding space as a city where:
- Related concepts are neighbors
- Documents about cooking cluster in one area
- Documents about programming cluster elsewhere
- Some areas (like "machine learning for cooking") bridge neighborhoods
When we search for "how to make pasta", we're finding documents that live in the same neighborhood—not because they contain the exact words, but because they're about the same topic.
Dimensionality Reduction (for visualization)
Tools like t-SNE and UMAP can project high-dimensional embeddings into 2D or 3D for visualization. While this loses information, it helps build intuition:
768D → 2D projection
[Programming topics cluster together]
[Cooking topics cluster separately]
[ML + cooking topics appear between clusters]
LLMs vs. Embedding Models
It's crucial to distinguish between generative models (like Gemini for text generation) and embedding models (like text-embedding-004).
Generative Models
Purpose: Generate text based on input
How they work:
- Take input text
- Predict the next token (word/subword)
- Repeat until the response is complete
Examples:
- Gemini (gemini-1.5-pro, gemini-1.5-flash)
- GPT-4, GPT-3.5
- Claude
Use in RAG: Generation phase—creating the final response
Embedding Models
Purpose: Convert text to vector representations
How they work:
- Take input text
- Process through transformer layers
- Output a fixed-size vector representing meaning
Examples:
- Gemini text-embedding-004
- OpenAI text-embedding-3-small
- Cohere embed-v3
Use in RAG: Indexing and retrieval phases—vectorizing documents and queries
Why Different Models?
You might wonder: if generative models understand text, why not use them for embeddings?
Specialized Training: Embedding models are trained specifically for the task of producing meaningful vector representations. They optimize for a different objective—making similar texts have similar vectors.
Efficiency: Embedding models are smaller and faster. Generating a 768-dimensional vector is much quicker than generating a full text response.
Cost: Embedding API calls are significantly cheaper than generation calls.
| Operation | Model | Relative Cost |
|---|---|---|
| Embed 1000 tokens | text-embedding-004 | $0.001 |
| Generate 1000 tokens | gemini-1.5-pro | ~$0.01 |
For indexing millions of document chunks, this cost difference is substantial.
The Gemini Embedding Model
For this course, we'll use Google's text-embedding-004 model for creating embeddings.
Model Specifications
| Property | Value |
|---|---|
| Model Name | text-embedding-004 |
| Output Dimensions | 768 |
| Max Input Tokens | 2048 |
| Supported Task Types | RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION |
Task Types Explained
Gemini's embedding model supports different task types that optimize the embedding for specific use cases:
RETRIEVAL_DOCUMENT Use when embedding documents that will be searched.
const embedding = await model.embedContent({
content: { parts: [{ text: documentChunk }] },
taskType: "RETRIEVAL_DOCUMENT"
});
RETRIEVAL_QUERY Use when embedding user queries for search.
const embedding = await model.embedContent({
content: { parts: [{ text: userQuestion }] },
taskType: "RETRIEVAL_QUERY"
});
Why different task types?
Queries and documents are different in nature:
- Queries are short, informal questions
- Documents are longer, more formal explanations
The model adjusts its embedding strategy based on task type, improving retrieval quality. Always use RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search queries.
Conceptual API Usage
Here's what interacting with the embedding API looks like conceptually:
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "text-embedding-004" });
// Embedding a document chunk
async function embedDocument(text: string): Promise<number[]> {
const result = await model.embedContent({
content: { parts: [{ text }] },
taskType: "RETRIEVAL_DOCUMENT"
});
return result.embedding.values; // [0.023, -0.145, ...]
}
// Embedding a user query
async function embedQuery(text: string): Promise<number[]> {
const result = await model.embedContent({
content: { parts: [{ text }] },
taskType: "RETRIEVAL_QUERY"
});
return result.embedding.values;
}
The result is a 768-dimensional array of floating-point numbers.
Token Limits and Chunking
The model accepts up to 2048 tokens per request. Tokens are roughly word-pieces:
- "authentication" = 1-2 tokens
- "How do I authenticate?" = ~5 tokens
For most use cases, this limit is generous. But it reinforces why we chunk documents—large documents must be split to fit within the token limit.
Vector Similarity: How Search Works
Once we have embeddings, we need to compare them. This is where similarity metrics come in.
Cosine Similarity
The most common similarity metric for embeddings is cosine similarity. It measures the angle between two vectors, ignoring their magnitude.
Mathematical Definition:
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
Where:
- A · B is the dot product
- ||A|| is the magnitude (length) of vector A
Intuition:
- Identical vectors: similarity = 1
- Perpendicular vectors: similarity = 0
- Opposite vectors: similarity = -1
For embeddings, we typically see values between 0 and 1, where:
- 0.9+ = very similar (probably the same topic)
- 0.7-0.9 = related content
- 0.5-0.7 = loosely related
- Below 0.5 = probably unrelated
Why Cosine Similarity?
Direction matters, magnitude doesn't:
Two documents about "machine learning" should be similar regardless of their length. Cosine similarity compares direction (what the text is about) while ignoring magnitude (how much text there is).
Efficient computation:
Modern databases and vector stores can compute cosine similarity efficiently, enabling fast search over millions of vectors.
Example: Similarity in Practice
// Query embedding
const query = "How do I reset my password?";
const queryEmbedding = await embedQuery(query);
// Document embeddings (from database)
const documents = [
{ id: 1, text: "Password Reset Guide...", embedding: [...] },
{ id: 2, text: "Account Settings...", embedding: [...] },
{ id: 3, text: "API Authentication...", embedding: [...] }
];
// Calculate similarities
const similarities = documents.map(doc => ({
id: doc.id,
text: doc.text,
similarity: cosineSimilarity(queryEmbedding, doc.embedding)
}));
// Results:
// { id: 1, text: "Password Reset Guide...", similarity: 0.91 } ← Best match
// { id: 2, text: "Account Settings...", similarity: 0.76 }
// { id: 3, text: "API Authentication...", similarity: 0.45 }
Other Similarity Metrics
While cosine similarity is most common, others exist:
Euclidean Distance: Measures straight-line distance between vectors. Lower is more similar.
Dot Product: Raw dot product without normalization. Faster but affected by vector magnitude.
For RAG applications, cosine similarity is almost always the right choice. pgvector supports all three, but we'll use cosine similarity (via the <=> operator).
Embeddings in the RAG Pipeline
Let's connect embeddings to the full RAG system:
Indexing Phase
Documents → Chunks → Embeddings → Vector Database
1. Split "Getting Started Guide" into chunks
2. For each chunk:
- Call Gemini with taskType: "RETRIEVAL_DOCUMENT"
- Get 768-dimensional embedding
- Store chunk text + embedding in Supabase
Retrieval Phase
User Query → Query Embedding → Vector Search → Top-K Results
1. User asks: "How do I install the SDK?"
2. Call Gemini with taskType: "RETRIEVAL_QUERY"
3. Search database for nearest embeddings (cosine similarity)
4. Return top 5 most similar chunks
Why This Works
The magic is that semantically similar content produces similar embeddings, regardless of exact wording.
| Query | Retrieved Document | Why it matches |
|---|---|---|
| "reset password" | "Changing your password" | Same concept, different words |
| "API auth" | "Authentication guide" | Semantic understanding |
| "error 404" | "Handling not found responses" | Contextual connection |
Traditional keyword search would miss many of these matches. Embedding-based search finds them because it understands meaning, not just words.
Practical Considerations
Embedding Quality
Not all embedding models are equal. Quality factors include:
- Training data (more diverse = better generalization)
- Model architecture (newer models often perform better)
- Dimensionality (more dimensions can capture more nuance)
Gemini's text-embedding-004 is a strong choice that balances quality, cost, and ease of use.
Batch Processing
When indexing many documents, batch your embedding requests:
// Instead of:
for (const chunk of chunks) {
const embedding = await embedDocument(chunk); // One request per chunk
}
// Better:
const embeddings = await embedDocuments(chunks); // Single batched request
This reduces API overhead and often qualifies for batch pricing.
Caching Considerations
Document embeddings are deterministic—the same text always produces the same embedding. This means:
- You only need to embed documents once
- Store embeddings in the database alongside text
- Re-embed only when document content changes
Query embeddings, however, happen at request time and typically aren't cached (each query is unique).
Summary
In this lesson, we've explored vector embeddings in depth:
Key Takeaways:
-
Embeddings capture meaning: They convert text into numbers while preserving semantic relationships
-
Embedding models differ from generative models: Specialized for creating vector representations, faster and cheaper
-
Gemini's text-embedding-004: Our embedding model, producing 768-dimensional vectors with task-type optimization
-
Cosine similarity enables search: Measures semantic similarity between embeddings
-
Embeddings bridge query and document: Different words, same meaning still match
Next Steps
In the next lesson, we'll see how all these pieces fit together in our architectural overview. You'll understand exactly how Next.js, Supabase, and Gemini work together to create a complete RAG system, and where each component's responsibilities lie.
"A good representation is worth a thousand algorithms." — Unknown

