Vectorization and Storage

Introduction

With our documents chunked and prepared, the next step is vectorization—converting text chunks into the numerical embeddings that enable semantic search. Once generated, these embeddings need to be stored efficiently alongside their source text.

This lesson covers the complete process of turning text into vectors and storing them in Supabase with pgvector. You'll understand the technical details of embedding generation, database schema design, and the considerations for building a scalable indexing pipeline.

Generating Embeddings

The Embedding Process

Converting text to vectors is conceptually simple:

Text Chunk → Embedding Model → 768-Dimensional Vector

But there are important details to consider for production systems.

Using Gemini's Embedding API

Here's the conceptual flow for generating embeddings with Gemini:

import { GoogleGenerativeAI } from '@google/generative-ai';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);

async function generateEmbedding(text: string): Promise<number[]> {
  const model = genAI.getGenerativeModel({ model: 'text-embedding-004' });

  const result = await model.embedContent({
    content: { parts: [{ text }] },
    taskType: 'RETRIEVAL_DOCUMENT'  // Important for indexing
  });

  return result.embedding.values;
}

Critical: Task Type Selection

The taskType parameter optimizes the embedding for specific use cases:

Task Type	When to Use
`RETRIEVAL_DOCUMENT`	When embedding documents/chunks for indexing
`RETRIEVAL_QUERY`	When embedding user questions for search
`SEMANTIC_SIMILARITY`	When comparing two texts directly
`CLASSIFICATION`	When using embeddings for classification tasks

Always use RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search queries. This asymmetric approach improves retrieval quality.

Handling Token Limits

Gemini's embedding model accepts up to 2048 tokens per request. Most properly chunked text fits comfortably, but you should handle edge cases:

async function safeEmbed(text: string): Promise<number[]> {
  // Rough token estimation (4 chars ≈ 1 token)
  const estimatedTokens = text.length / 4;

  if (estimatedTokens > 2000) {
    console.warn(`Text may exceed token limit: ${estimatedTokens} estimated tokens`);
    // Option 1: Truncate
    text = text.slice(0, 8000);  // ~2000 tokens
    // Option 2: Further chunk (but this complicates your pipeline)
  }

  return generateEmbedding(text);
}

Batch Processing for Efficiency

When indexing many chunks, batch your API calls:

async function embedBatch(texts: string[]): Promise<number[][]> {
  // Process in smaller batches to manage memory and rate limits
  const BATCH_SIZE = 100;
  const embeddings: number[][] = [];

  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
    const batch = texts.slice(i, i + BATCH_SIZE);

    // Process batch concurrently
    const batchEmbeddings = await Promise.all(
      batch.map(text => generateEmbedding(text))
    );

    embeddings.push(...batchEmbeddings);

    // Progress logging
    console.log(`Processed ${Math.min(i + BATCH_SIZE, texts.length)}/${texts.length} chunks`);

    // Rate limiting (if needed)
    if (i + BATCH_SIZE < texts.length) {
      await sleep(100);  // Small delay between batches
    }
  }

  return embeddings;
}

Error Handling and Retries

API calls can fail. Implement robust error handling:

async function embedWithRetry(
  text: string,
  maxRetries: number = 3
): Promise<number[]> {
  let lastError: Error;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await generateEmbedding(text);
    } catch (error) {
      lastError = error as Error;

      if (attempt < maxRetries) {
        // Exponential backoff
        const delay = Math.pow(2, attempt) * 1000;
        console.warn(`Embedding attempt ${attempt} failed, retrying in ${delay}ms`);
        await sleep(delay);
      }
    }
  }

  throw new Error(`Failed to embed after ${maxRetries} attempts: ${lastError.message}`);
}

The Documents Table Schema

Core Schema Design

The documents table stores both the text content and its vector embedding:

-- Enable the pgvector extension (run once)
CREATE EXTENSION IF NOT EXISTS vector;

-- Main documents table
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

  -- The actual content
  content TEXT NOT NULL,

  -- Vector embedding (768 dimensions for text-embedding-004)
  embedding VECTOR(768),

  -- Source attribution
  source TEXT NOT NULL,
  title TEXT,

  -- Position in original document
  chunk_index INTEGER DEFAULT 0,

  -- Timestamps
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Index for fast vector search
CREATE INDEX documents_embedding_idx ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Understanding Each Field

id (UUID): Unique identifier for each chunk. UUIDs are ideal because they're globally unique without coordination.

content (TEXT): The actual text chunk. This is what gets shown to the user and included in prompts.

embedding (VECTOR(768)): The 768-dimensional vector from Gemini. The VECTOR type comes from pgvector.

source (TEXT): Reference to the original document (filename, URL, etc.). Essential for attribution.

title (TEXT): Human-readable title, often extracted from headings.

chunk_index (INTEGER): Position within the original document. Enables retrieving surrounding chunks.

Vector Index Options

pgvector supports different index types for vector search:

IVFFlat (Inverted File Flat):

CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Faster for large datasets
Approximate nearest neighbor
lists parameter: more lists = better accuracy, slower build
Rule of thumb: lists = sqrt(num_records)

HNSW (Hierarchical Navigable Small World):

CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Better recall than IVFFlat
Faster queries on moderate datasets
More memory intensive

Recommendation:

Dataset Size	Recommended Index
< 100K vectors	HNSW
100K - 1M vectors	IVFFlat or HNSW
> 1M vectors	IVFFlat

For most applications starting out, IVFFlat with lists=100 is a solid default.

Multi-Tenant Schema

For applications where different users have different knowledge bases:

CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL REFERENCES auth.users(id),  -- Owner

  content TEXT NOT NULL,
  embedding VECTOR(768),

  source TEXT NOT NULL,
  title TEXT,
  chunk_index INTEGER DEFAULT 0,

  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Composite index for user-scoped searches
CREATE INDEX documents_user_embedding_idx ON documents (user_id)
INCLUDE (embedding);

-- Vector index
CREATE INDEX documents_embedding_idx ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Row Level Security
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

CREATE POLICY "Users can read own documents"
ON documents FOR SELECT
USING (auth.uid() = user_id);

CREATE POLICY "Users can insert own documents"
ON documents FOR INSERT
WITH CHECK (auth.uid() = user_id);

CREATE POLICY "Users can delete own documents"
ON documents FOR DELETE
USING (auth.uid() = user_id);

Batch Processing: The Indexer Architecture

Pipeline Overview

A production indexing pipeline typically looks like:

┌─────────────────────────────────────────────────────────────┐
│                    INDEXING PIPELINE                        │
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   Source    │    │   Process   │    │   Store     │     │
│  │   Files     │───▶│   & Embed   │───▶│   Results   │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│                                                             │
│  • Read files        • Chunk text       • Insert rows      │
│  • Parse formats     • Generate         • Update index     │
│  • Extract text        embeddings       • Track progress   │
└─────────────────────────────────────────────────────────────┘

Conceptual Indexer Implementation

// scripts/ingest.ts
interface Document {
  source: string;
  title: string;
  content: string;
}

interface Chunk {
  content: string;
  source: string;
  title: string;
  chunkIndex: number;
}

interface IndexedChunk extends Chunk {
  embedding: number[];
}

class DocumentIndexer {
  private supabase: SupabaseClient;

  constructor() {
    this.supabase = createClient(
      process.env.SUPABASE_URL!,
      process.env.SUPABASE_SERVICE_KEY!  // Service key for admin operations
    );
  }

  async indexDocuments(documents: Document[]): Promise<void> {
    console.log(`Starting indexing of ${documents.length} documents`);

    for (const doc of documents) {
      await this.indexDocument(doc);
    }

    console.log('Indexing complete');
  }

  private async indexDocument(doc: Document): Promise<void> {
    console.log(`Processing: ${doc.source}`);

    // 1. Chunk the document
    const chunks = this.chunkDocument(doc);
    console.log(`  Created ${chunks.length} chunks`);

    // 2. Generate embeddings
    const embeddings = await embedBatch(chunks.map(c => c.content));
    console.log(`  Generated embeddings`);

    // 3. Store in database
    const records = chunks.map((chunk, i) => ({
      content: chunk.content,
      embedding: embeddings[i],
      source: chunk.source,
      title: chunk.title,
      chunk_index: chunk.chunkIndex
    }));

    const { error } = await this.supabase
      .from('documents')
      .insert(records);

    if (error) {
      throw new Error(`Failed to insert chunks: ${error.message}`);
    }

    console.log(`  Stored ${records.length} chunks`);
  }

  private chunkDocument(doc: Document): Chunk[] {
    const chunks = recursiveSplit(doc.content, 800);

    return chunks.map((content, index) => ({
      content,
      source: doc.source,
      title: doc.title,
      chunkIndex: index
    }));
  }
}

// Usage
async function main() {
  const indexer = new DocumentIndexer();

  // Load your documents
  const documents = await loadDocuments('./docs');

  // Index them
  await indexer.indexDocuments(documents);
}

Handling Updates

When documents change, you need to re-index. Here's a strategy:

async function updateDocument(source: string, newContent: string): Promise<void> {
  // 1. Delete existing chunks for this source
  await supabase
    .from('documents')
    .delete()
    .eq('source', source);

  // 2. Re-index the new content
  await indexDocument({
    source,
    title: extractTitle(newContent),
    content: newContent
  });
}

For more sophisticated systems, consider:

Content hashing to detect changes
Incremental updates (only re-embed changed sections)
Version tracking

Progress Tracking and Resumability

For large indexing jobs, track progress:

interface IndexingProgress {
  totalDocuments: number;
  processedDocuments: number;
  failedDocuments: string[];
  startTime: Date;
}

class ResumableIndexer {
  private progress: IndexingProgress;
  private progressFile: string = './indexing-progress.json';

  async loadProgress(): Promise<void> {
    try {
      const data = await fs.readFile(this.progressFile, 'utf-8');
      this.progress = JSON.parse(data);
    } catch {
      this.progress = {
        totalDocuments: 0,
        processedDocuments: 0,
        failedDocuments: [],
        startTime: new Date()
      };
    }
  }

  async saveProgress(): Promise<void> {
    await fs.writeFile(
      this.progressFile,
      JSON.stringify(this.progress, null, 2)
    );
  }

  async indexWithResume(documents: Document[]): Promise<void> {
    await this.loadProgress();

    // Skip already processed documents
    const remaining = documents.slice(this.progress.processedDocuments);

    for (const doc of remaining) {
      try {
        await this.indexDocument(doc);
        this.progress.processedDocuments++;
      } catch (error) {
        this.progress.failedDocuments.push(doc.source);
        console.error(`Failed to index ${doc.source}:`, error);
      }

      // Save progress periodically
      if (this.progress.processedDocuments % 10 === 0) {
        await this.saveProgress();
      }
    }

    await this.saveProgress();
  }
}

Cost Considerations

Embedding Costs

Gemini's embedding API is relatively inexpensive, but costs add up at scale:

Scenario	Estimated Chunks	Estimated Cost
Small docs (100 pages)	~500 chunks	< $0.01
Medium docs (1000 pages)	~5,000 chunks	~$0.05
Large docs (10,000 pages)	~50,000 chunks	~$0.50

Cost optimization tips:

Chunk efficiently: Larger chunks mean fewer embeddings
Avoid re-indexing unchanged content
Use batch requests where possible
Cache embeddings: If the same text appears multiple times, reuse the embedding

Storage Costs

Vector storage in Supabase:

Each vector (768 floats) ≈ 3KB
100,000 vectors ≈ 300MB

Supabase's free tier includes 500MB database storage, sufficient for most starting applications.

Summary

In this lesson, we covered the complete vectorization and storage process:

Key Takeaways:

Task type matters: Use RETRIEVAL_DOCUMENT for indexing, RETRIEVAL_QUERY for search
Batch for efficiency: Process chunks in batches with proper error handling
Schema design affects performance: Include proper indexes and consider multi-tenancy from the start
Choose the right index: IVFFlat for large datasets, HNSW for smaller ones with higher accuracy needs
Plan for updates: Documents change; design your pipeline to handle re-indexing
Track progress: Large indexing jobs should be resumable

Next Steps

With our documents chunked, embedded, and stored, we need a way to search them. In the next lesson, we'll build the Supabase Search Functionality—the RPC function that performs vector similarity search and returns relevant context for our RAG pipeline.

"Data is the new oil, but like oil, it's only valuable when refined." — Clive Humby