Module 2: Embeddings - Turning Text Into Numbers
The Bridge Between Human Language and Machine Understanding
Introduction
Embeddings are the foundation of everything we do with vector databases. They're the technology that converts human concepts—text, images, audio—into the numerical representations that machines can compare and search.
By the end of this module, you'll understand:
- How embedding models work (at a practical level)
- How to generate embeddings using popular APIs
- How to choose the right embedding model
- Common pitfalls and best practices
2.1 What Are Embeddings?
From Text to Numbers
An embedding model takes text and converts it into a fixed-size array of floating-point numbers:
const text = "The quick brown fox jumps over the lazy dog"
const embedding = await embed(text)
// [0.023, -0.156, 0.892, 0.045, -0.234, ...] (1536 numbers)
Key properties:
- Fixed size: Every input produces the same number of dimensions
- Dense: Most values are non-zero (unlike sparse keyword vectors)
- Semantic: Similar meanings produce similar vectors
- Learned: The model learns what makes content similar during training
Why Embeddings Work
Embedding models are trained on massive amounts of text to learn relationships:
"king" - "man" + "woman" ≈ "queen"
This famous example shows that embeddings capture semantic relationships, not just word presence.
In practice:
- "How do I reset my password?" ≈ "I forgot my login credentials"
- "Best restaurants in Paris" ≈ "Where to eat in the French capital"
- "Python programming tutorial" ≈ "Learn to code with Python"
2.2 Generating Embeddings
Using OpenAI Embeddings
The most common choice for production applications:
import OpenAI from 'openai'
const openai = new OpenAI()
async function getEmbedding(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
})
return response.data[0].embedding
}
// Usage
const embedding = await getEmbedding("What is machine learning?")
console.log(`Dimensions: ${embedding.length}`) // 1536
Batch Embeddings
For efficiency, embed multiple texts at once:
async function getEmbeddings(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
})
return response.data.map(d => d.embedding)
}
// Usage
const documents = [
"Introduction to vector databases",
"How embeddings work",
"Similarity search explained"
]
const embeddings = await getEmbeddings(documents)
Using Cohere Embeddings
Another popular option with strong multilingual support:
import { CohereClient } from 'cohere-ai'
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY })
async function getEmbedding(text: string): Promise<number[]> {
const response = await cohere.embed({
texts: [text],
model: 'embed-english-v3.0',
inputType: 'search_document' // or 'search_query'
})
return response.embeddings[0]
}
Note: Cohere distinguishes between document and query embeddings—use search_document when indexing and search_query when searching.
Using Local Models
For privacy or cost reasons, you might want local embeddings:
# Python example using sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_embedding(text: str) -> list[float]:
return model.encode(text).tolist()
# Usage
embedding = get_embedding("What is machine learning?")
print(f"Dimensions: {len(embedding)}") # 384
2.3 Choosing an Embedding Model
Model Comparison
| Model | Dimensions | Speed | Quality | Cost |
|---|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1,536 | Fast | Good | $ |
| text-embedding-3-large (OpenAI) | 3,072 | Medium | Best | $$ |
| embed-english-v3.0 (Cohere) | 1,024 | Fast | Good | $ |
| embed-multilingual-v3.0 (Cohere) | 1,024 | Fast | Good | $ |
| all-MiniLM-L6-v2 (Open Source) | 384 | Very Fast | Fair | Free |
| all-mpnet-base-v2 (Open Source) | 768 | Fast | Good | Free |
Decision Factors
1. Quality vs. Cost
- For production with budget: text-embedding-3-small
- For maximum quality: text-embedding-3-large
- For development/testing: local models
2. Multilingual Requirements
- English only: Most models work well
- Multilingual: Cohere multilingual or OpenAI (good for many languages)
3. Privacy & Latency
- Sensitive data: Local models
- Low latency requirements: Local or edge deployment
4. Dimension Considerations
- Higher dimensions = better quality, more storage, slower search
- OpenAI offers dimension reduction: use 256-512 for faster search
// OpenAI dimension reduction
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
dimensions: 512 // Reduce from 1536 to 512
})
2.4 Embedding Best Practices
1. Consistency is Key
Always use the same model for indexing and querying.
// WRONG: Using different models
const docEmbedding = await embedWithOpenAI(doc) // 1536 dims
const queryEmbedding = await embedWithCohere(query) // 1024 dims
// These can't be compared!
// RIGHT: Same model for everything
const docEmbedding = await embedWithOpenAI(doc)
const queryEmbedding = await embedWithOpenAI(query)
// These are comparable
2. Chunking Long Documents
Embedding models have token limits (typically 8,192 tokens). Long documents need chunking:
function chunkText(text: string, chunkSize = 500, overlap = 50): string[] {
const words = text.split(' ')
const chunks: string[] = []
for (let i = 0; i < words.length; i += chunkSize - overlap) {
chunks.push(words.slice(i, i + chunkSize).join(' '))
}
return chunks
}
// Embed each chunk separately
const chunks = chunkText(longDocument)
const embeddings = await Promise.all(chunks.map(getEmbedding))
Chunking strategies:
- Fixed size (simple, might split sentences)
- Sentence-based (respects boundaries)
- Semantic (split on topic changes)
- Recursive (try large, then smaller)
3. Preprocessing Text
Clean your text before embedding:
function preprocessText(text: string): string {
return text
.toLowerCase() // Optional: normalize case
.replace(/\s+/g, ' ') // Normalize whitespace
.trim()
}
But don't over-process—embeddings capture nuance that you might remove.
4. Handling Different Content Types
Different content types might need different approaches:
// Code: Keep structure
const codeEmbedding = await embed(`
function add(a, b) {
return a + b;
}
`)
// FAQ: Combine question and answer
const faqEmbedding = await embed(
`Question: ${question}\nAnswer: ${answer}`
)
// Product: Include key attributes
const productEmbedding = await embed(
`${product.name}. ${product.description}.
Category: ${product.category}.
Features: ${product.features.join(', ')}`
)
2.5 Common Pitfalls
1. Embedding Mismatch
// Don't mix models!
// Your index: OpenAI embeddings
// Your query: Cohere embeddings
// Result: Meaningless similarity scores
2. Ignoring Token Limits
// This will fail or truncate
const hugeText = "...".repeat(100000)
const embedding = await embed(hugeText) // Token limit exceeded!
// Solution: Chunk first
const chunks = chunkText(hugeText)
3. Not Batching
// Slow: One API call per document
for (const doc of documents) {
const embedding = await embed(doc) // 1000 API calls!
}
// Fast: Batch API calls
const embeddings = await embedBatch(documents) // 10 API calls
4. Forgetting About Updates
When you change your embedding model, you need to re-embed everything:
// Migration plan:
// 1. Create new index with new model embeddings
// 2. Backfill all documents
// 3. Switch queries to new index
// 4. Delete old index
2.6 Practical Example: Building an Embedding Pipeline
Here's a complete example of an embedding pipeline:
import OpenAI from 'openai'
const openai = new OpenAI()
interface Document {
id: string
content: string
metadata: Record<string, any>
}
interface EmbeddedDocument {
id: string
embedding: number[]
metadata: Record<string, any>
}
const BATCH_SIZE = 100
const MODEL = 'text-embedding-3-small'
async function embedDocuments(
documents: Document[]
): Promise<EmbeddedDocument[]> {
const results: EmbeddedDocument[] = []
// Process in batches
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
const batch = documents.slice(i, i + BATCH_SIZE)
// Get embeddings for batch
const response = await openai.embeddings.create({
model: MODEL,
input: batch.map(d => d.content),
})
// Combine with metadata
for (let j = 0; j < batch.length; j++) {
results.push({
id: batch[j].id,
embedding: response.data[j].embedding,
metadata: batch[j].metadata,
})
}
console.log(`Processed ${Math.min(i + BATCH_SIZE, documents.length)}/${documents.length}`)
}
return results
}
// Usage
const documents: Document[] = [
{ id: '1', content: 'Introduction to ML', metadata: { category: 'ai' } },
{ id: '2', content: 'Web development basics', metadata: { category: 'web' } },
// ...
]
const embedded = await embedDocuments(documents)
// Now ready to insert into vector database
Key Takeaways
- Embeddings convert meaning into numbers that machines can compare
- Use the same model for both indexing and querying
- Chunk long documents to stay within token limits
- Batch API calls for efficiency
- Choose your model based on quality, cost, and requirements
Exercise: Generate Your First Embeddings
- Sign up for an OpenAI API key (if you don't have one)
- Generate embeddings for 5 similar sentences and 5 different sentences
- Calculate cosine similarity between pairs
- Verify that similar sentences have higher similarity scores
// Starter code
import OpenAI from 'openai'
const openai = new OpenAI()
const similar = [
"I love programming",
"Coding is my passion",
"I enjoy writing software",
]
const different = [
"The weather is nice today",
"I need to buy groceries",
]
// Your task: Embed and compare
Next up: Module 3 - Similarity Search Explained

