The Role of Vector Embeddings

Introduction

In the previous lesson, we learned that RAG systems retrieve relevant information before generating responses. But how do we find "relevant" information? If a user asks "How do I reset my password?", we need to find documentation about password resets—even if the exact phrase "reset my password" doesn't appear in our documents.

The answer lies in vector embeddings—numerical representations of text that capture semantic meaning. This lesson explores how embeddings work, why they're essential for RAG, and how we'll use Google's Gemini embedding model in our applications.

Understanding Embeddings

From Words to Numbers

Computers work with numbers, not words. To process text computationally, we need to convert it into numerical form. But not all numerical representations are equal.

Naive Approach: Character Codes

"cat" → [99, 97, 116]  (ASCII codes)
"dog" → [100, 111, 103]

This tells us nothing about meaning. "Cat" and "dog" are semantically similar (both animals, pets), but their numerical representations are arbitrary.

Better Approach: One-Hot Encoding

Vocabulary: [cat, dog, car, house]
"cat" → [1, 0, 0, 0]
"dog" → [0, 1, 0, 0]

Still no semantic information. Every word is equally different from every other word.

Best Approach: Embeddings

"cat" → [0.23, -0.45, 0.12, ..., 0.78]  (768 dimensions)
"dog" → [0.25, -0.42, 0.15, ..., 0.76]  (768 dimensions)
"car" → [-0.67, 0.23, 0.89, ..., -0.34] (768 dimensions)

Now "cat" and "dog" have similar vectors because they're semantically similar, while "car" is quite different.

What Are Vector Embeddings?

A vector embedding is a list of numbers (typically hundreds or thousands) that represents the meaning of a piece of text. Each number in the list is a dimension, and together they define a point in high-dimensional space.

Key Properties:

Semantic Similarity = Vector Similarity
- Similar meanings produce similar vectors
- "Laptop" and "computer" will be close together
- "Laptop" and "banana" will be far apart
Context Matters
- The same word can have different embeddings in different contexts
- "Bank" (financial) vs "bank" (river) would embed differently
Dense Representations
- Unlike sparse representations (one-hot encoding), embeddings pack meaning into every dimension
- More efficient storage and computation

Dimensionality: What the Numbers Mean

Embedding vectors typically have hundreds of dimensions. For example, Gemini's text-embedding-004 produces 768-dimensional vectors.

Each dimension captures some aspect of meaning, though not in a human-interpretable way. You might imagine dimensions for:

Is this about technology?
Is this formal or casual?
Does this involve physical objects?
Is there emotional content?

In reality, the dimensions emerge from training and don't map cleanly to human concepts. But they capture patterns that enable semantic comparison.

Visualizing High-Dimensional Space

Humans can visualize up to 3 dimensions. How do we think about 768 dimensions?

Mental Model: Semantic Neighborhoods

Think of embedding space as a city where:

Related concepts are neighbors
Documents about cooking cluster in one area
Documents about programming cluster elsewhere
Some areas (like "machine learning for cooking") bridge neighborhoods

When we search for "how to make pasta", we're finding documents that live in the same neighborhood—not because they contain the exact words, but because they're about the same topic.

Dimensionality Reduction (for visualization)

Tools like t-SNE and UMAP can project high-dimensional embeddings into 2D or 3D for visualization. While this loses information, it helps build intuition:

768D → 2D projection
[Programming topics cluster together]
[Cooking topics cluster separately]
[ML + cooking topics appear between clusters]

LLMs vs. Embedding Models

It's crucial to distinguish between generative models (like Gemini for text generation) and embedding models (like text-embedding-004).

Generative Models

Purpose: Generate text based on input

How they work:

Take input text
Predict the next token (word/subword)
Repeat until the response is complete

Examples:

Gemini (gemini-1.5-pro, gemini-1.5-flash)
GPT-4, GPT-3.5
Claude

Use in RAG: Generation phase—creating the final response

Embedding Models

Purpose: Convert text to vector representations

How they work:

Take input text
Process through transformer layers
Output a fixed-size vector representing meaning

Examples:

Gemini text-embedding-004
OpenAI text-embedding-3-small
Cohere embed-v3

Use in RAG: Indexing and retrieval phases—vectorizing documents and queries

Why Different Models?

You might wonder: if generative models understand text, why not use them for embeddings?

Specialized Training: Embedding models are trained specifically for the task of producing meaningful vector representations. They optimize for a different objective—making similar texts have similar vectors.

Efficiency: Embedding models are smaller and faster. Generating a 768-dimensional vector is much quicker than generating a full text response.

Cost: Embedding API calls are significantly cheaper than generation calls.

Operation	Model	Relative Cost
Embed 1000 tokens	text-embedding-004	$0.001
Generate 1000 tokens	gemini-1.5-pro	~$0.01

For indexing millions of document chunks, this cost difference is substantial.

The Gemini Embedding Model

For this course, we'll use Google's text-embedding-004 model for creating embeddings.

Model Specifications

Property	Value
Model Name	text-embedding-004
Output Dimensions	768
Max Input Tokens	2048
Supported Task Types	RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION

Task Types Explained

Gemini's embedding model supports different task types that optimize the embedding for specific use cases:

RETRIEVAL_DOCUMENT Use when embedding documents that will be searched.

const embedding = await model.embedContent({
  content: { parts: [{ text: documentChunk }] },
  taskType: "RETRIEVAL_DOCUMENT"
});

RETRIEVAL_QUERY Use when embedding user queries for search.

const embedding = await model.embedContent({
  content: { parts: [{ text: userQuestion }] },
  taskType: "RETRIEVAL_QUERY"
});

Why different task types?

Queries and documents are different in nature:

Queries are short, informal questions
Documents are longer, more formal explanations

The model adjusts its embedding strategy based on task type, improving retrieval quality. Always use RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search queries.

Conceptual API Usage

Here's what interacting with the embedding API looks like conceptually:

import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "text-embedding-004" });

// Embedding a document chunk
async function embedDocument(text: string): Promise<number[]> {
  const result = await model.embedContent({
    content: { parts: [{ text }] },
    taskType: "RETRIEVAL_DOCUMENT"
  });
  return result.embedding.values; // [0.023, -0.145, ...]
}

// Embedding a user query
async function embedQuery(text: string): Promise<number[]> {
  const result = await model.embedContent({
    content: { parts: [{ text }] },
    taskType: "RETRIEVAL_QUERY"
  });
  return result.embedding.values;
}

The result is a 768-dimensional array of floating-point numbers.

Token Limits and Chunking

The model accepts up to 2048 tokens per request. Tokens are roughly word-pieces:

"authentication" = 1-2 tokens
"How do I authenticate?" = ~5 tokens

For most use cases, this limit is generous. But it reinforces why we chunk documents—large documents must be split to fit within the token limit.

Vector Similarity: How Search Works

Once we have embeddings, we need to compare them. This is where similarity metrics come in.

Cosine Similarity

The most common similarity metric for embeddings is cosine similarity. It measures the angle between two vectors, ignoring their magnitude.

Mathematical Definition:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:

A · B is the dot product
||A|| is the magnitude (length) of vector A

Intuition:

Identical vectors: similarity = 1
Perpendicular vectors: similarity = 0
Opposite vectors: similarity = -1

For embeddings, we typically see values between 0 and 1, where:

0.9+ = very similar (probably the same topic)
0.7-0.9 = related content
0.5-0.7 = loosely related
Below 0.5 = probably unrelated

Why Cosine Similarity?

Direction matters, magnitude doesn't:

Two documents about "machine learning" should be similar regardless of their length. Cosine similarity compares direction (what the text is about) while ignoring magnitude (how much text there is).

Efficient computation:

Modern databases and vector stores can compute cosine similarity efficiently, enabling fast search over millions of vectors.

Example: Similarity in Practice

// Query embedding
const query = "How do I reset my password?";
const queryEmbedding = await embedQuery(query);

// Document embeddings (from database)
const documents = [
  { id: 1, text: "Password Reset Guide...", embedding: [...] },
  { id: 2, text: "Account Settings...", embedding: [...] },
  { id: 3, text: "API Authentication...", embedding: [...] }
];

// Calculate similarities
const similarities = documents.map(doc => ({
  id: doc.id,
  text: doc.text,
  similarity: cosineSimilarity(queryEmbedding, doc.embedding)
}));

// Results:
// { id: 1, text: "Password Reset Guide...", similarity: 0.91 }  ← Best match
// { id: 2, text: "Account Settings...", similarity: 0.76 }
// { id: 3, text: "API Authentication...", similarity: 0.45 }

Other Similarity Metrics

While cosine similarity is most common, others exist:

Euclidean Distance: Measures straight-line distance between vectors. Lower is more similar.

Dot Product: Raw dot product without normalization. Faster but affected by vector magnitude.

For RAG applications, cosine similarity is almost always the right choice. pgvector supports all three, but we'll use cosine similarity (via the <=> operator).

Embeddings in the RAG Pipeline

Let's connect embeddings to the full RAG system:

Indexing Phase

Documents → Chunks → Embeddings → Vector Database

1. Split "Getting Started Guide" into chunks
2. For each chunk:
   - Call Gemini with taskType: "RETRIEVAL_DOCUMENT"
   - Get 768-dimensional embedding
   - Store chunk text + embedding in Supabase

Retrieval Phase

User Query → Query Embedding → Vector Search → Top-K Results

1. User asks: "How do I install the SDK?"
2. Call Gemini with taskType: "RETRIEVAL_QUERY"
3. Search database for nearest embeddings (cosine similarity)
4. Return top 5 most similar chunks

Why This Works

The magic is that semantically similar content produces similar embeddings, regardless of exact wording.

Query	Retrieved Document	Why it matches
"reset password"	"Changing your password"	Same concept, different words
"API auth"	"Authentication guide"	Semantic understanding
"error 404"	"Handling not found responses"	Contextual connection

Traditional keyword search would miss many of these matches. Embedding-based search finds them because it understands meaning, not just words.

Practical Considerations

Embedding Quality

Not all embedding models are equal. Quality factors include:

Training data (more diverse = better generalization)
Model architecture (newer models often perform better)
Dimensionality (more dimensions can capture more nuance)

Gemini's text-embedding-004 is a strong choice that balances quality, cost, and ease of use.

Batch Processing

When indexing many documents, batch your embedding requests:

// Instead of:
for (const chunk of chunks) {
  const embedding = await embedDocument(chunk); // One request per chunk
}

// Better:
const embeddings = await embedDocuments(chunks); // Single batched request

This reduces API overhead and often qualifies for batch pricing.

Caching Considerations

Document embeddings are deterministic—the same text always produces the same embedding. This means:

You only need to embed documents once
Store embeddings in the database alongside text
Re-embed only when document content changes

Query embeddings, however, happen at request time and typically aren't cached (each query is unique).

Summary

In this lesson, we've explored vector embeddings in depth:

Key Takeaways:

Embeddings capture meaning: They convert text into numbers while preserving semantic relationships
Embedding models differ from generative models: Specialized for creating vector representations, faster and cheaper
Gemini's text-embedding-004: Our embedding model, producing 768-dimensional vectors with task-type optimization
Cosine similarity enables search: Measures semantic similarity between embeddings
Embeddings bridge query and document: Different words, same meaning still match

Next Steps

In the next lesson, we'll see how all these pieces fit together in our architectural overview. You'll understand exactly how Next.js, Supabase, and Gemini work together to create a complete RAG system, and where each component's responsibilities lie.

"A good representation is worth a thousand algorithms." — Unknown

The Role of Vector Embeddings

Introduction

Understanding Embeddings

From Words to Numbers

Computers work with numbers, not words. To process text computationally, we need to convert it into numerical form. But not all numerical representations are equal.

Naive Approach: Character Codes

"cat" → [99, 97, 116]  (ASCII codes)
"dog" → [100, 111, 103]

This tells us nothing about meaning. "Cat" and "dog" are semantically similar (both animals, pets), but their numerical representations are arbitrary.

Better Approach: One-Hot Encoding

Vocabulary: [cat, dog, car, house]
"cat" → [1, 0, 0, 0]
"dog" → [0, 1, 0, 0]

Still no semantic information. Every word is equally different from every other word.

Best Approach: Embeddings

"cat" → [0.23, -0.45, 0.12, ..., 0.78]  (768 dimensions)
"dog" → [0.25, -0.42, 0.15, ..., 0.76]  (768 dimensions)
"car" → [-0.67, 0.23, 0.89, ..., -0.34] (768 dimensions)

Now "cat" and "dog" have similar vectors because they're semantically similar, while "car" is quite different.

What Are Vector Embeddings?

Key Properties:

Semantic Similarity = Vector Similarity
- Similar meanings produce similar vectors
- "Laptop" and "computer" will be close together
- "Laptop" and "banana" will be far apart
Context Matters
- The same word can have different embeddings in different contexts
- "Bank" (financial) vs "bank" (river) would embed differently
Dense Representations
- Unlike sparse representations (one-hot encoding), embeddings pack meaning into every dimension
- More efficient storage and computation

Dimensionality: What the Numbers Mean

Embedding vectors typically have hundreds of dimensions. For example, Gemini's text-embedding-004 produces 768-dimensional vectors.

Each dimension captures some aspect of meaning, though not in a human-interpretable way. You might imagine dimensions for:

Is this about technology?
Is this formal or casual?
Does this involve physical objects?
Is there emotional content?

In reality, the dimensions emerge from training and don't map cleanly to human concepts. But they capture patterns that enable semantic comparison.

Visualizing High-Dimensional Space

Humans can visualize up to 3 dimensions. How do we think about 768 dimensions?

Mental Model: Semantic Neighborhoods

Think of embedding space as a city where:

Related concepts are neighbors
Documents about cooking cluster in one area
Documents about programming cluster elsewhere
Some areas (like "machine learning for cooking") bridge neighborhoods

When we search for "how to make pasta", we're finding documents that live in the same neighborhood—not because they contain the exact words, but because they're about the same topic.

Dimensionality Reduction (for visualization)

Tools like t-SNE and UMAP can project high-dimensional embeddings into 2D or 3D for visualization. While this loses information, it helps build intuition:

768D → 2D projection
[Programming topics cluster together]
[Cooking topics cluster separately]
[ML + cooking topics appear between clusters]

LLMs vs. Embedding Models

It's crucial to distinguish between generative models (like Gemini for text generation) and embedding models (like text-embedding-004).

Generative Models

Purpose: Generate text based on input

How they work:

Take input text
Predict the next token (word/subword)
Repeat until the response is complete

Examples:

Gemini (gemini-1.5-pro, gemini-1.5-flash)
GPT-4, GPT-3.5
Claude

Use in RAG: Generation phase—creating the final response

Embedding Models

Purpose: Convert text to vector representations

How they work:

Take input text
Process through transformer layers
Output a fixed-size vector representing meaning

Examples:

Gemini text-embedding-004
OpenAI text-embedding-3-small
Cohere embed-v3

Use in RAG: Indexing and retrieval phases—vectorizing documents and queries

Why Different Models?

You might wonder: if generative models understand text, why not use them for embeddings?

Efficiency: Embedding models are smaller and faster. Generating a 768-dimensional vector is much quicker than generating a full text response.

Cost: Embedding API calls are significantly cheaper than generation calls.

Operation	Model	Relative Cost
Embed 1000 tokens	text-embedding-004	$0.001
Generate 1000 tokens	gemini-1.5-pro	~$0.01

For indexing millions of document chunks, this cost difference is substantial.

The Gemini Embedding Model

For this course, we'll use Google's text-embedding-004 model for creating embeddings.

Model Specifications

Property	Value
Model Name	text-embedding-004
Output Dimensions	768
Max Input Tokens	2048
Supported Task Types	RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION

Task Types Explained

Gemini's embedding model supports different task types that optimize the embedding for specific use cases:

RETRIEVAL_DOCUMENT Use when embedding documents that will be searched.

const embedding = await model.embedContent({
  content: { parts: [{ text: documentChunk }] },
  taskType: "RETRIEVAL_DOCUMENT"
});

RETRIEVAL_QUERY Use when embedding user queries for search.

const embedding = await model.embedContent({
  content: { parts: [{ text: userQuestion }] },
  taskType: "RETRIEVAL_QUERY"
});

Why different task types?

Queries and documents are different in nature:

Queries are short, informal questions
Documents are longer, more formal explanations

The model adjusts its embedding strategy based on task type, improving retrieval quality. Always use RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search queries.

Conceptual API Usage

Here's what interacting with the embedding API looks like conceptually:

import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "text-embedding-004" });

// Embedding a document chunk
async function embedDocument(text: string): Promise<number[]> {
  const result = await model.embedContent({
    content: { parts: [{ text }] },
    taskType: "RETRIEVAL_DOCUMENT"
  });
  return result.embedding.values; // [0.023, -0.145, ...]
}

// Embedding a user query
async function embedQuery(text: string): Promise<number[]> {
  const result = await model.embedContent({
    content: { parts: [{ text }] },
    taskType: "RETRIEVAL_QUERY"
  });
  return result.embedding.values;
}

The result is a 768-dimensional array of floating-point numbers.

Token Limits and Chunking

The model accepts up to 2048 tokens per request. Tokens are roughly word-pieces:

"authentication" = 1-2 tokens
"How do I authenticate?" = ~5 tokens

For most use cases, this limit is generous. But it reinforces why we chunk documents—large documents must be split to fit within the token limit.

Vector Similarity: How Search Works

Once we have embeddings, we need to compare them. This is where similarity metrics come in.

Cosine Similarity

The most common similarity metric for embeddings is cosine similarity. It measures the angle between two vectors, ignoring their magnitude.

Mathematical Definition:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:

A · B is the dot product
||A|| is the magnitude (length) of vector A

Intuition:

Identical vectors: similarity = 1
Perpendicular vectors: similarity = 0
Opposite vectors: similarity = -1

For embeddings, we typically see values between 0 and 1, where:

0.9+ = very similar (probably the same topic)
0.7-0.9 = related content
0.5-0.7 = loosely related
Below 0.5 = probably unrelated

Why Cosine Similarity?

Direction matters, magnitude doesn't:

Two documents about "machine learning" should be similar regardless of their length. Cosine similarity compares direction (what the text is about) while ignoring magnitude (how much text there is).

Efficient computation:

Modern databases and vector stores can compute cosine similarity efficiently, enabling fast search over millions of vectors.

Example: Similarity in Practice

// Query embedding
const query = "How do I reset my password?";
const queryEmbedding = await embedQuery(query);

// Document embeddings (from database)
const documents = [
  { id: 1, text: "Password Reset Guide...", embedding: [...] },
  { id: 2, text: "Account Settings...", embedding: [...] },
  { id: 3, text: "API Authentication...", embedding: [...] }
];

// Calculate similarities
const similarities = documents.map(doc => ({
  id: doc.id,
  text: doc.text,
  similarity: cosineSimilarity(queryEmbedding, doc.embedding)
}));

// Results:
// { id: 1, text: "Password Reset Guide...", similarity: 0.91 }  ← Best match
// { id: 2, text: "Account Settings...", similarity: 0.76 }
// { id: 3, text: "API Authentication...", similarity: 0.45 }

Other Similarity Metrics

While cosine similarity is most common, others exist:

Euclidean Distance: Measures straight-line distance between vectors. Lower is more similar.

Dot Product: Raw dot product without normalization. Faster but affected by vector magnitude.

For RAG applications, cosine similarity is almost always the right choice. pgvector supports all three, but we'll use cosine similarity (via the <=> operator).

Embeddings in the RAG Pipeline

Let's connect embeddings to the full RAG system:

Indexing Phase

Documents → Chunks → Embeddings → Vector Database

1. Split "Getting Started Guide" into chunks
2. For each chunk:
   - Call Gemini with taskType: "RETRIEVAL_DOCUMENT"
   - Get 768-dimensional embedding
   - Store chunk text + embedding in Supabase

Retrieval Phase

User Query → Query Embedding → Vector Search → Top-K Results

1. User asks: "How do I install the SDK?"
2. Call Gemini with taskType: "RETRIEVAL_QUERY"
3. Search database for nearest embeddings (cosine similarity)
4. Return top 5 most similar chunks

Why This Works

The magic is that semantically similar content produces similar embeddings, regardless of exact wording.

Query	Retrieved Document	Why it matches
"reset password"	"Changing your password"	Same concept, different words
"API auth"	"Authentication guide"	Semantic understanding
"error 404"	"Handling not found responses"	Contextual connection

Traditional keyword search would miss many of these matches. Embedding-based search finds them because it understands meaning, not just words.

Practical Considerations

Embedding Quality

Not all embedding models are equal. Quality factors include:

Training data (more diverse = better generalization)
Model architecture (newer models often perform better)
Dimensionality (more dimensions can capture more nuance)

Gemini's text-embedding-004 is a strong choice that balances quality, cost, and ease of use.

Batch Processing

When indexing many documents, batch your embedding requests:

// Instead of:
for (const chunk of chunks) {
  const embedding = await embedDocument(chunk); // One request per chunk
}

// Better:
const embeddings = await embedDocuments(chunks); // Single batched request

This reduces API overhead and often qualifies for batch pricing.

Caching Considerations

Document embeddings are deterministic—the same text always produces the same embedding. This means:

You only need to embed documents once
Store embeddings in the database alongside text
Re-embed only when document content changes

Query embeddings, however, happen at request time and typically aren't cached (each query is unique).

Summary

In this lesson, we've explored vector embeddings in depth:

Key Takeaways:

Embeddings capture meaning: They convert text into numbers while preserving semantic relationships
Embedding models differ from generative models: Specialized for creating vector representations, faster and cheaper
Gemini's text-embedding-004: Our embedding model, producing 768-dimensional vectors with task-type optimization
Cosine similarity enables search: Measures semantic similarity between embeddings
Embeddings bridge query and document: Different words, same meaning still match

Next Steps

"A good representation is worth a thousand algorithms." — Unknown