Module 2: Embeddings - Turning Text Into Numbers

The Bridge Between Human Language and Machine Understanding

Introduction

Embeddings are the foundation of everything we do with vector databases. They're the technology that converts human concepts—text, images, audio—into the numerical representations that machines can compare and search.

By the end of this module, you'll understand:

How embedding models work (at a practical level)
How to generate embeddings using popular APIs
How to choose the right embedding model
Common pitfalls and best practices

2.1 What Are Embeddings?

From Text to Numbers

An embedding model takes text and converts it into a fixed-size array of floating-point numbers:

const text = "The quick brown fox jumps over the lazy dog"
const embedding = await embed(text)
// [0.023, -0.156, 0.892, 0.045, -0.234, ...] (1536 numbers)

Key properties:

Fixed size: Every input produces the same number of dimensions
Dense: Most values are non-zero (unlike sparse keyword vectors)
Semantic: Similar meanings produce similar vectors
Learned: The model learns what makes content similar during training

Why Embeddings Work

Embedding models are trained on massive amounts of text to learn relationships:

"king" - "man" + "woman" ≈ "queen"

This famous example shows that embeddings capture semantic relationships, not just word presence.

In practice:

"How do I reset my password?" ≈ "I forgot my login credentials"
"Best restaurants in Paris" ≈ "Where to eat in the French capital"
"Python programming tutorial" ≈ "Learn to code with Python"

2.2 Generating Embeddings

Using OpenAI Embeddings

The most common choice for production applications:

import OpenAI from 'openai'

const openai = new OpenAI()

async function getEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  })
  return response.data[0].embedding
}

// Usage
const embedding = await getEmbedding("What is machine learning?")
console.log(`Dimensions: ${embedding.length}`) // 1536

Batch Embeddings

For efficiency, embed multiple texts at once:

async function getEmbeddings(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  })
  return response.data.map(d => d.embedding)
}

// Usage
const documents = [
  "Introduction to vector databases",
  "How embeddings work",
  "Similarity search explained"
]
const embeddings = await getEmbeddings(documents)

Using Cohere Embeddings

Another popular option with strong multilingual support:

import { CohereClient } from 'cohere-ai'

const cohere = new CohereClient({ token: process.env.COHERE_API_KEY })

async function getEmbedding(text: string): Promise<number[]> {
  const response = await cohere.embed({
    texts: [text],
    model: 'embed-english-v3.0',
    inputType: 'search_document' // or 'search_query'
  })
  return response.embeddings[0]
}

Note: Cohere distinguishes between document and query embeddings—use search_document when indexing and search_query when searching.

Using Local Models

For privacy or cost reasons, you might want local embeddings:

# Python example using sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def get_embedding(text: str) -> list[float]:
    return model.encode(text).tolist()

# Usage
embedding = get_embedding("What is machine learning?")
print(f"Dimensions: {len(embedding)}")  # 384

2.3 Choosing an Embedding Model

Model Comparison

Model	Dimensions	Speed	Quality	Cost
text-embedding-3-small (OpenAI)	1,536	Fast	Good	$
text-embedding-3-large (OpenAI)	3,072	Medium	Best	$$
embed-english-v3.0 (Cohere)	1,024	Fast	Good	$
embed-multilingual-v3.0 (Cohere)	1,024	Fast	Good	$
all-MiniLM-L6-v2 (Open Source)	384	Very Fast	Fair	Free
all-mpnet-base-v2 (Open Source)	768	Fast	Good	Free

Decision Factors

1. Quality vs. Cost

For production with budget: text-embedding-3-small
For maximum quality: text-embedding-3-large
For development/testing: local models

2. Multilingual Requirements

English only: Most models work well
Multilingual: Cohere multilingual or OpenAI (good for many languages)

3. Privacy & Latency

Sensitive data: Local models
Low latency requirements: Local or edge deployment

4. Dimension Considerations

Higher dimensions = better quality, more storage, slower search
OpenAI offers dimension reduction: use 256-512 for faster search

// OpenAI dimension reduction
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: text,
  dimensions: 512  // Reduce from 1536 to 512
})

2.4 Embedding Best Practices

1. Consistency is Key

Always use the same model for indexing and querying.

// WRONG: Using different models
const docEmbedding = await embedWithOpenAI(doc)  // 1536 dims
const queryEmbedding = await embedWithCohere(query)  // 1024 dims
// These can't be compared!

// RIGHT: Same model for everything
const docEmbedding = await embedWithOpenAI(doc)
const queryEmbedding = await embedWithOpenAI(query)
// These are comparable

2. Chunking Long Documents

Embedding models have token limits (typically 8,192 tokens). Long documents need chunking:

function chunkText(text: string, chunkSize = 500, overlap = 50): string[] {
  const words = text.split(' ')
  const chunks: string[] = []

  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    chunks.push(words.slice(i, i + chunkSize).join(' '))
  }

  return chunks
}

// Embed each chunk separately
const chunks = chunkText(longDocument)
const embeddings = await Promise.all(chunks.map(getEmbedding))

Chunking strategies:

Fixed size (simple, might split sentences)
Sentence-based (respects boundaries)
Semantic (split on topic changes)
Recursive (try large, then smaller)

3. Preprocessing Text

Clean your text before embedding:

function preprocessText(text: string): string {
  return text
    .toLowerCase()  // Optional: normalize case
    .replace(/\s+/g, ' ')  // Normalize whitespace
    .trim()
}

But don't over-process—embeddings capture nuance that you might remove.

4. Handling Different Content Types

Different content types might need different approaches:

// Code: Keep structure
const codeEmbedding = await embed(`
  function add(a, b) {
    return a + b;
  }
`)

// FAQ: Combine question and answer
const faqEmbedding = await embed(
  `Question: ${question}\nAnswer: ${answer}`
)

// Product: Include key attributes
const productEmbedding = await embed(
  `${product.name}. ${product.description}.
   Category: ${product.category}.
   Features: ${product.features.join(', ')}`
)

2.5 Common Pitfalls

1. Embedding Mismatch

// Don't mix models!
// Your index: OpenAI embeddings
// Your query: Cohere embeddings
// Result: Meaningless similarity scores

2. Ignoring Token Limits

// This will fail or truncate
const hugeText = "...".repeat(100000)
const embedding = await embed(hugeText)  // Token limit exceeded!

// Solution: Chunk first
const chunks = chunkText(hugeText)

3. Not Batching

// Slow: One API call per document
for (const doc of documents) {
  const embedding = await embed(doc)  // 1000 API calls!
}

// Fast: Batch API calls
const embeddings = await embedBatch(documents)  // 10 API calls

4. Forgetting About Updates

When you change your embedding model, you need to re-embed everything:

// Migration plan:
// 1. Create new index with new model embeddings
// 2. Backfill all documents
// 3. Switch queries to new index
// 4. Delete old index

2.6 Practical Example: Building an Embedding Pipeline

Here's a complete example of an embedding pipeline:

import OpenAI from 'openai'

const openai = new OpenAI()

interface Document {
  id: string
  content: string
  metadata: Record<string, any>
}

interface EmbeddedDocument {
  id: string
  embedding: number[]
  metadata: Record<string, any>
}

const BATCH_SIZE = 100
const MODEL = 'text-embedding-3-small'

async function embedDocuments(
  documents: Document[]
): Promise<EmbeddedDocument[]> {
  const results: EmbeddedDocument[] = []

  // Process in batches
  for (let i = 0; i < documents.length; i += BATCH_SIZE) {
    const batch = documents.slice(i, i + BATCH_SIZE)

    // Get embeddings for batch
    const response = await openai.embeddings.create({
      model: MODEL,
      input: batch.map(d => d.content),
    })

    // Combine with metadata
    for (let j = 0; j < batch.length; j++) {
      results.push({
        id: batch[j].id,
        embedding: response.data[j].embedding,
        metadata: batch[j].metadata,
      })
    }

    console.log(`Processed ${Math.min(i + BATCH_SIZE, documents.length)}/${documents.length}`)
  }

  return results
}

// Usage
const documents: Document[] = [
  { id: '1', content: 'Introduction to ML', metadata: { category: 'ai' } },
  { id: '2', content: 'Web development basics', metadata: { category: 'web' } },
  // ...
]

const embedded = await embedDocuments(documents)
// Now ready to insert into vector database

Key Takeaways

Embeddings convert meaning into numbers that machines can compare
Use the same model for both indexing and querying
Chunk long documents to stay within token limits
Batch API calls for efficiency
Choose your model based on quality, cost, and requirements

Exercise: Generate Your First Embeddings

Sign up for an OpenAI API key (if you don't have one)
Generate embeddings for 5 similar sentences and 5 different sentences
Calculate cosine similarity between pairs
Verify that similar sentences have higher similarity scores

// Starter code
import OpenAI from 'openai'

const openai = new OpenAI()

const similar = [
  "I love programming",
  "Coding is my passion",
  "I enjoy writing software",
]

const different = [
  "The weather is nice today",
  "I need to buy groceries",
]

// Your task: Embed and compare

Next up: Module 3 - Similarity Search Explained