Vectorization and Storage
Introduction
With our documents chunked and prepared, the next step is vectorization—converting text chunks into the numerical embeddings that enable semantic search. Once generated, these embeddings need to be stored efficiently alongside their source text.
This lesson covers the complete process of turning text into vectors and storing them in Supabase with pgvector. You'll understand the technical details of embedding generation, database schema design, and the considerations for building a scalable indexing pipeline.
Generating Embeddings
The Embedding Process
Converting text to vectors is conceptually simple:
Text Chunk → Embedding Model → 768-Dimensional Vector
But there are important details to consider for production systems.
Using Gemini's Embedding API
Here's the conceptual flow for generating embeddings with Gemini:
import { GoogleGenerativeAI } from '@google/generative-ai';
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
async function generateEmbedding(text: string): Promise<number[]> {
const model = genAI.getGenerativeModel({ model: 'text-embedding-004' });
const result = await model.embedContent({
content: { parts: [{ text }] },
taskType: 'RETRIEVAL_DOCUMENT' // Important for indexing
});
return result.embedding.values;
}
Critical: Task Type Selection
The taskType parameter optimizes the embedding for specific use cases:
| Task Type | When to Use |
|---|---|
RETRIEVAL_DOCUMENT | When embedding documents/chunks for indexing |
RETRIEVAL_QUERY | When embedding user questions for search |
SEMANTIC_SIMILARITY | When comparing two texts directly |
CLASSIFICATION | When using embeddings for classification tasks |
Always use RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search queries. This asymmetric approach improves retrieval quality.
Handling Token Limits
Gemini's embedding model accepts up to 2048 tokens per request. Most properly chunked text fits comfortably, but you should handle edge cases:
async function safeEmbed(text: string): Promise<number[]> {
// Rough token estimation (4 chars ≈ 1 token)
const estimatedTokens = text.length / 4;
if (estimatedTokens > 2000) {
console.warn(`Text may exceed token limit: ${estimatedTokens} estimated tokens`);
// Option 1: Truncate
text = text.slice(0, 8000); // ~2000 tokens
// Option 2: Further chunk (but this complicates your pipeline)
}
return generateEmbedding(text);
}
Batch Processing for Efficiency
When indexing many chunks, batch your API calls:
async function embedBatch(texts: string[]): Promise<number[][]> {
// Process in smaller batches to manage memory and rate limits
const BATCH_SIZE = 100;
const embeddings: number[][] = [];
for (let i = 0; i < texts.length; i += BATCH_SIZE) {
const batch = texts.slice(i, i + BATCH_SIZE);
// Process batch concurrently
const batchEmbeddings = await Promise.all(
batch.map(text => generateEmbedding(text))
);
embeddings.push(...batchEmbeddings);
// Progress logging
console.log(`Processed ${Math.min(i + BATCH_SIZE, texts.length)}/${texts.length} chunks`);
// Rate limiting (if needed)
if (i + BATCH_SIZE < texts.length) {
await sleep(100); // Small delay between batches
}
}
return embeddings;
}
Error Handling and Retries
API calls can fail. Implement robust error handling:
async function embedWithRetry(
text: string,
maxRetries: number = 3
): Promise<number[]> {
let lastError: Error;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await generateEmbedding(text);
} catch (error) {
lastError = error as Error;
if (attempt < maxRetries) {
// Exponential backoff
const delay = Math.pow(2, attempt) * 1000;
console.warn(`Embedding attempt ${attempt} failed, retrying in ${delay}ms`);
await sleep(delay);
}
}
}
throw new Error(`Failed to embed after ${maxRetries} attempts: ${lastError.message}`);
}
The Documents Table Schema
Core Schema Design
The documents table stores both the text content and its vector embedding:
-- Enable the pgvector extension (run once)
CREATE EXTENSION IF NOT EXISTS vector;
-- Main documents table
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- The actual content
content TEXT NOT NULL,
-- Vector embedding (768 dimensions for text-embedding-004)
embedding VECTOR(768),
-- Source attribution
source TEXT NOT NULL,
title TEXT,
-- Position in original document
chunk_index INTEGER DEFAULT 0,
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index for fast vector search
CREATE INDEX documents_embedding_idx ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Understanding Each Field
id (UUID): Unique identifier for each chunk. UUIDs are ideal because they're globally unique without coordination.
content (TEXT): The actual text chunk. This is what gets shown to the user and included in prompts.
embedding (VECTOR(768)):
The 768-dimensional vector from Gemini. The VECTOR type comes from pgvector.
source (TEXT): Reference to the original document (filename, URL, etc.). Essential for attribution.
title (TEXT): Human-readable title, often extracted from headings.
chunk_index (INTEGER): Position within the original document. Enables retrieving surrounding chunks.
Vector Index Options
pgvector supports different index types for vector search:
IVFFlat (Inverted File Flat):
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
- Faster for large datasets
- Approximate nearest neighbor
listsparameter: more lists = better accuracy, slower build- Rule of thumb:
lists=sqrt(num_records)
HNSW (Hierarchical Navigable Small World):
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
- Better recall than IVFFlat
- Faster queries on moderate datasets
- More memory intensive
Recommendation:
| Dataset Size | Recommended Index |
|---|---|
| < 100K vectors | HNSW |
| 100K - 1M vectors | IVFFlat or HNSW |
| > 1M vectors | IVFFlat |
For most applications starting out, IVFFlat with lists=100 is a solid default.
Multi-Tenant Schema
For applications where different users have different knowledge bases:
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES auth.users(id), -- Owner
content TEXT NOT NULL,
embedding VECTOR(768),
source TEXT NOT NULL,
title TEXT,
chunk_index INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Composite index for user-scoped searches
CREATE INDEX documents_user_embedding_idx ON documents (user_id)
INCLUDE (embedding);
-- Vector index
CREATE INDEX documents_embedding_idx ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Row Level Security
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Users can read own documents"
ON documents FOR SELECT
USING (auth.uid() = user_id);
CREATE POLICY "Users can insert own documents"
ON documents FOR INSERT
WITH CHECK (auth.uid() = user_id);
CREATE POLICY "Users can delete own documents"
ON documents FOR DELETE
USING (auth.uid() = user_id);
Batch Processing: The Indexer Architecture
Pipeline Overview
A production indexing pipeline typically looks like:
┌─────────────────────────────────────────────────────────────┐
│ INDEXING PIPELINE │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Source │ │ Process │ │ Store │ │
│ │ Files │───▶│ & Embed │───▶│ Results │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ • Read files • Chunk text • Insert rows │
│ • Parse formats • Generate • Update index │
│ • Extract text embeddings • Track progress │
└─────────────────────────────────────────────────────────────┘
Conceptual Indexer Implementation
// scripts/ingest.ts
interface Document {
source: string;
title: string;
content: string;
}
interface Chunk {
content: string;
source: string;
title: string;
chunkIndex: number;
}
interface IndexedChunk extends Chunk {
embedding: number[];
}
class DocumentIndexer {
private supabase: SupabaseClient;
constructor() {
this.supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY! // Service key for admin operations
);
}
async indexDocuments(documents: Document[]): Promise<void> {
console.log(`Starting indexing of ${documents.length} documents`);
for (const doc of documents) {
await this.indexDocument(doc);
}
console.log('Indexing complete');
}
private async indexDocument(doc: Document): Promise<void> {
console.log(`Processing: ${doc.source}`);
// 1. Chunk the document
const chunks = this.chunkDocument(doc);
console.log(` Created ${chunks.length} chunks`);
// 2. Generate embeddings
const embeddings = await embedBatch(chunks.map(c => c.content));
console.log(` Generated embeddings`);
// 3. Store in database
const records = chunks.map((chunk, i) => ({
content: chunk.content,
embedding: embeddings[i],
source: chunk.source,
title: chunk.title,
chunk_index: chunk.chunkIndex
}));
const { error } = await this.supabase
.from('documents')
.insert(records);
if (error) {
throw new Error(`Failed to insert chunks: ${error.message}`);
}
console.log(` Stored ${records.length} chunks`);
}
private chunkDocument(doc: Document): Chunk[] {
const chunks = recursiveSplit(doc.content, 800);
return chunks.map((content, index) => ({
content,
source: doc.source,
title: doc.title,
chunkIndex: index
}));
}
}
// Usage
async function main() {
const indexer = new DocumentIndexer();
// Load your documents
const documents = await loadDocuments('./docs');
// Index them
await indexer.indexDocuments(documents);
}
Handling Updates
When documents change, you need to re-index. Here's a strategy:
async function updateDocument(source: string, newContent: string): Promise<void> {
// 1. Delete existing chunks for this source
await supabase
.from('documents')
.delete()
.eq('source', source);
// 2. Re-index the new content
await indexDocument({
source,
title: extractTitle(newContent),
content: newContent
});
}
For more sophisticated systems, consider:
- Content hashing to detect changes
- Incremental updates (only re-embed changed sections)
- Version tracking
Progress Tracking and Resumability
For large indexing jobs, track progress:
interface IndexingProgress {
totalDocuments: number;
processedDocuments: number;
failedDocuments: string[];
startTime: Date;
}
class ResumableIndexer {
private progress: IndexingProgress;
private progressFile: string = './indexing-progress.json';
async loadProgress(): Promise<void> {
try {
const data = await fs.readFile(this.progressFile, 'utf-8');
this.progress = JSON.parse(data);
} catch {
this.progress = {
totalDocuments: 0,
processedDocuments: 0,
failedDocuments: [],
startTime: new Date()
};
}
}
async saveProgress(): Promise<void> {
await fs.writeFile(
this.progressFile,
JSON.stringify(this.progress, null, 2)
);
}
async indexWithResume(documents: Document[]): Promise<void> {
await this.loadProgress();
// Skip already processed documents
const remaining = documents.slice(this.progress.processedDocuments);
for (const doc of remaining) {
try {
await this.indexDocument(doc);
this.progress.processedDocuments++;
} catch (error) {
this.progress.failedDocuments.push(doc.source);
console.error(`Failed to index ${doc.source}:`, error);
}
// Save progress periodically
if (this.progress.processedDocuments % 10 === 0) {
await this.saveProgress();
}
}
await this.saveProgress();
}
}
Cost Considerations
Embedding Costs
Gemini's embedding API is relatively inexpensive, but costs add up at scale:
| Scenario | Estimated Chunks | Estimated Cost |
|---|---|---|
| Small docs (100 pages) | ~500 chunks | < $0.01 |
| Medium docs (1000 pages) | ~5,000 chunks | ~$0.05 |
| Large docs (10,000 pages) | ~50,000 chunks | ~$0.50 |
Cost optimization tips:
- Chunk efficiently: Larger chunks mean fewer embeddings
- Avoid re-indexing unchanged content
- Use batch requests where possible
- Cache embeddings: If the same text appears multiple times, reuse the embedding
Storage Costs
Vector storage in Supabase:
- Each vector (768 floats) ≈ 3KB
- 100,000 vectors ≈ 300MB
Supabase's free tier includes 500MB database storage, sufficient for most starting applications.
Summary
In this lesson, we covered the complete vectorization and storage process:
Key Takeaways:
-
Task type matters: Use
RETRIEVAL_DOCUMENTfor indexing,RETRIEVAL_QUERYfor search -
Batch for efficiency: Process chunks in batches with proper error handling
-
Schema design affects performance: Include proper indexes and consider multi-tenancy from the start
-
Choose the right index: IVFFlat for large datasets, HNSW for smaller ones with higher accuracy needs
-
Plan for updates: Documents change; design your pipeline to handle re-indexing
-
Track progress: Large indexing jobs should be resumable
Next Steps
With our documents chunked, embedded, and stored, we need a way to search them. In the next lesson, we'll build the Supabase Search Functionality—the RPC function that performs vector similarity search and returns relevant context for our RAG pipeline.
"Data is the new oil, but like oil, it's only valuable when refined." — Clive Humby

