Module 11: Performance Optimization
Making Your Vector Database Fast
Introduction
Performance matters. Users expect instant results, and slow queries can make AI applications feel broken.
By the end of this module, you'll know how to:
- Measure and monitor performance
- Optimize query latency
- Improve throughput
- Handle common bottlenecks
11.1 Performance Metrics
Key Metrics to Track
Latency:
- p50 (median): Typical user experience
- p95: Most users' worst experience
- p99: Edge cases and tail latency
Throughput:
- Queries per second (QPS)
- Insertions per second
Resource Usage:
- CPU utilization
- Memory usage
- Disk I/O
Setting Performance Goals
| Use Case | p50 Target | p99 Target | QPS Target |
|---|---|---|---|
| Real-time chat | < 50ms | < 200ms | 100+ |
| Search API | < 100ms | < 500ms | 50+ |
| Batch processing | < 1s | < 5s | 10+ |
| Background jobs | < 5s | < 30s | 1+ |
Measuring Performance
async function measureQueryLatency(
queryFn: () => Promise<any>,
iterations: number = 100
): Promise<{ p50: number; p95: number; p99: number }> {
const latencies: number[] = []
for (let i = 0; i < iterations; i++) {
const start = performance.now()
await queryFn()
latencies.push(performance.now() - start)
}
latencies.sort((a, b) => a - b)
return {
p50: latencies[Math.floor(iterations * 0.5)],
p95: latencies[Math.floor(iterations * 0.95)],
p99: latencies[Math.floor(iterations * 0.99)]
}
}
// Usage
const stats = await measureQueryLatency(async () => {
await index.query({ vector: testVector, topK: 10 })
})
console.log(`p50: ${stats.p50.toFixed(2)}ms, p99: ${stats.p99.toFixed(2)}ms`)
11.2 Query Optimization
1. Reduce TopK
Only fetch what you need:
// Bad: Fetching too many
const results = await index.query({ vector, topK: 100 })
const used = results.matches.slice(0, 5) // Only use 5!
// Good: Fetch only what you need
const results = await index.query({ vector, topK: 5 })
2. Use Appropriate Search Parameters
Trade accuracy for speed when acceptable:
// pgvector: Lower ef_search for speed
await pool.query('SET hnsw.ef_search = 40') // Default: 40
// Pinecone: Use faster pod type for non-critical queries
// No direct control, but affects overall performance
// Qdrant: Tune search parameters
await client.search('collection', {
vector: queryVector,
limit: 10,
params: {
hnsw_ef: 50, // Lower = faster, less accurate
exact: false
}
})
3. Reduce Vector Dimensions
Lower dimensions = faster search:
// At indexing time (OpenAI)
const embedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
dimensions: 512 // Instead of 1536
})
// Trade-off: Slightly lower quality, much faster search
4. Filter Early
Apply metadata filters before vector search:
// Efficient: Filter first, then search smaller set
const results = await index.query({
vector,
topK: 10,
filter: { category: 'electronics', inStock: true }
})
// Databases use pre-filtering when possible
5. Cache Common Queries
import { LRUCache } from 'lru-cache'
const queryCache = new LRUCache<string, SearchResult[]>({
max: 1000,
ttl: 1000 * 60 * 5 // 5 minutes
})
async function cachedSearch(query: string): Promise<SearchResult[]> {
const cacheKey = hashQuery(query)
const cached = queryCache.get(cacheKey)
if (cached) {
metrics.cacheHit()
return cached
}
const results = await vectorSearch(query)
queryCache.set(cacheKey, results)
metrics.cacheMiss()
return results
}
function hashQuery(query: string): string {
// Simple hash for cache key
return crypto.createHash('md5').update(query).digest('hex')
}
11.3 Indexing Optimization
Index Tuning
-- pgvector: Balance between build time and search speed
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 128);
-- For read-heavy workloads, higher m and ef_construction
-- For write-heavy workloads, lower values
Batch Inserts
// Bad: One at a time
for (const doc of documents) {
await index.upsert([doc]) // Many round trips
}
// Good: Batch operations
const batchSize = 100
for (let i = 0; i < documents.length; i += batchSize) {
const batch = documents.slice(i, i + batchSize)
await index.upsert(batch)
}
Parallel Indexing
// Parallel batch processing
async function parallelUpsert(
documents: Document[],
batchSize: number = 100,
concurrency: number = 5
): Promise<void> {
const batches: Document[][] = []
for (let i = 0; i < documents.length; i += batchSize) {
batches.push(documents.slice(i, i + batchSize))
}
// Process batches with controlled concurrency
for (let i = 0; i < batches.length; i += concurrency) {
const chunk = batches.slice(i, i + concurrency)
await Promise.all(chunk.map(batch => index.upsert(batch)))
console.log(`Processed ${Math.min((i + concurrency) * batchSize, documents.length)}/${documents.length}`)
}
}
11.4 Infrastructure Optimization
Memory Management
Vector databases are memory-intensive:
Memory per vector ≈ 4 bytes × dimensions + metadata overhead
1 million vectors × 1536 dimensions:
= 1M × 1536 × 4 bytes = ~6 GB vectors
+ Index overhead (~2-3x) = ~12-18 GB
+ Metadata = varies
Total: 15-20 GB for 1M vectors
Recommendations:
- Size your instances appropriately
- Monitor memory usage
- Consider quantization for large datasets
Connection Pooling
import { Pool } from 'pg'
// Good: Connection pool
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 20, // Maximum connections
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000
})
// Bad: New connection per query
async function query() {
const client = new Client() // Don't do this!
await client.connect()
// ...
await client.end()
}
Regional Deployment
Deploy close to users:
// Example: Multi-region Pinecone
const index = pinecone.index('my-index', 'us-east1-gcp')
// For global apps, use multiple indexes or edge caching
11.5 Monitoring and Debugging
Essential Metrics Dashboard
class VectorDBMetrics {
private latencies: number[] = []
private errors = 0
private queries = 0
recordQuery(latencyMs: number) {
this.queries++
this.latencies.push(latencyMs)
if (this.latencies.length > 10000) {
this.latencies = this.latencies.slice(-10000)
}
}
recordError() {
this.errors++
}
getStats() {
const sorted = [...this.latencies].sort((a, b) => a - b)
return {
totalQueries: this.queries,
errors: this.errors,
errorRate: this.errors / this.queries,
p50: sorted[Math.floor(sorted.length * 0.5)] || 0,
p95: sorted[Math.floor(sorted.length * 0.95)] || 0,
p99: sorted[Math.floor(sorted.length * 0.99)] || 0
}
}
}
const metrics = new VectorDBMetrics()
// Wrap your queries
async function trackedQuery(vector: number[], topK: number) {
const start = performance.now()
try {
const result = await index.query({ vector, topK })
metrics.recordQuery(performance.now() - start)
return result
} catch (error) {
metrics.recordError()
throw error
}
}
Slow Query Detection
const SLOW_QUERY_THRESHOLD = 500 // ms
async function queryWithSlowDetection(params: QueryParams) {
const start = performance.now()
const result = await index.query(params)
const duration = performance.now() - start
if (duration > SLOW_QUERY_THRESHOLD) {
console.warn('Slow query detected:', {
duration: `${duration.toFixed(2)}ms`,
topK: params.topK,
hasFilter: !!params.filter,
filterComplexity: JSON.stringify(params.filter).length
})
}
return result
}
Health Checks
async function healthCheck(): Promise<{
status: 'healthy' | 'degraded' | 'unhealthy'
latency: number
details: string
}> {
try {
const testVector = new Array(1536).fill(0).map(() => Math.random())
const start = performance.now()
await index.query({
vector: testVector,
topK: 1
})
const latency = performance.now() - start
if (latency > 1000) {
return {
status: 'degraded',
latency,
details: 'Query latency above threshold'
}
}
return {
status: 'healthy',
latency,
details: 'All systems operational'
}
} catch (error) {
return {
status: 'unhealthy',
latency: -1,
details: error instanceof Error ? error.message : 'Unknown error'
}
}
}
11.6 Common Bottlenecks and Solutions
1. Embedding Generation
Problem: Embedding API is slow
Solutions:
// Batch embeddings
const embeddings = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: texts // Array, not single string
})
// Use caching for repeated texts
const embeddingCache = new Map<string, number[]>()
async function getEmbeddingCached(text: string): Promise<number[]> {
const cached = embeddingCache.get(text)
if (cached) return cached
const embedding = await getEmbedding(text)
embeddingCache.set(text, embedding)
return embedding
}
// Use smaller/faster models for non-critical paths
// text-embedding-3-small vs text-embedding-3-large
2. Network Latency
Problem: Database is far from application
Solutions:
- Deploy in same region
- Use connection pooling
- Implement edge caching
- Consider embedded databases (Chroma) for local-first
3. Cold Starts
Problem: First queries are slow
Solutions:
// Warm up on startup
async function warmUp() {
const testVector = new Array(1536).fill(0)
for (let i = 0; i < 10; i++) {
await index.query({ vector: testVector, topK: 1 })
}
console.log('Vector database warmed up')
}
// Keep connections alive
setInterval(async () => {
await index.query({ vector: testVector, topK: 1 })
}, 60000) // Ping every minute
4. Large Result Sets
Problem: Fetching too much data
Solutions:
// Fetch IDs only, load full data separately
const results = await index.query({
vector: queryVector,
topK: 100,
includeMetadata: false // IDs only
})
// Fetch full data only for what you need
const needed = results.matches.slice(0, 10)
const fullData = await fetchDocuments(needed.map(r => r.id))
Key Takeaways
- Measure first—know your baseline before optimizing
- Reduce topK to only what you need
- Cache aggressively for repeated queries
- Batch operations for indexing
- Monitor continuously to catch regressions
Exercise: Performance Audit
- Set up monitoring for your vector database
- Run a load test with realistic query patterns
- Identify the slowest queries
- Apply optimizations from this module
- Measure the improvement
Document your findings:
- Before/after latency metrics
- Bottlenecks identified
- Optimizations applied
- Remaining issues
Next up: Module 12 - Scaling Considerations

