Module 11: Performance Optimization

Making Your Vector Database Fast

Introduction

Performance matters. Users expect instant results, and slow queries can make AI applications feel broken.

By the end of this module, you'll know how to:

Measure and monitor performance
Optimize query latency
Improve throughput
Handle common bottlenecks

11.1 Performance Metrics

Key Metrics to Track

Latency:

p50 (median): Typical user experience
p95: Most users' worst experience
p99: Edge cases and tail latency

Throughput:

Queries per second (QPS)
Insertions per second

Resource Usage:

CPU utilization
Memory usage
Disk I/O

Setting Performance Goals

Use Case	p50 Target	p99 Target	QPS Target
Real-time chat	< 50ms	< 200ms	100+
Search API	< 100ms	< 500ms	50+
Batch processing	< 1s	< 5s	10+
Background jobs	< 5s	< 30s	1+

Measuring Performance

async function measureQueryLatency(
  queryFn: () => Promise<any>,
  iterations: number = 100
): Promise<{ p50: number; p95: number; p99: number }> {
  const latencies: number[] = []

  for (let i = 0; i < iterations; i++) {
    const start = performance.now()
    await queryFn()
    latencies.push(performance.now() - start)
  }

  latencies.sort((a, b) => a - b)

  return {
    p50: latencies[Math.floor(iterations * 0.5)],
    p95: latencies[Math.floor(iterations * 0.95)],
    p99: latencies[Math.floor(iterations * 0.99)]
  }
}

// Usage
const stats = await measureQueryLatency(async () => {
  await index.query({ vector: testVector, topK: 10 })
})
console.log(`p50: ${stats.p50.toFixed(2)}ms, p99: ${stats.p99.toFixed(2)}ms`)

11.2 Query Optimization

1. Reduce TopK

Only fetch what you need:

// Bad: Fetching too many
const results = await index.query({ vector, topK: 100 })
const used = results.matches.slice(0, 5)  // Only use 5!

// Good: Fetch only what you need
const results = await index.query({ vector, topK: 5 })

2. Use Appropriate Search Parameters

Trade accuracy for speed when acceptable:

// pgvector: Lower ef_search for speed
await pool.query('SET hnsw.ef_search = 40')  // Default: 40

// Pinecone: Use faster pod type for non-critical queries
// No direct control, but affects overall performance

// Qdrant: Tune search parameters
await client.search('collection', {
  vector: queryVector,
  limit: 10,
  params: {
    hnsw_ef: 50,  // Lower = faster, less accurate
    exact: false
  }
})

3. Reduce Vector Dimensions

Lower dimensions = faster search:

// At indexing time (OpenAI)
const embedding = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: text,
  dimensions: 512  // Instead of 1536
})

// Trade-off: Slightly lower quality, much faster search

4. Filter Early

Apply metadata filters before vector search:

// Efficient: Filter first, then search smaller set
const results = await index.query({
  vector,
  topK: 10,
  filter: { category: 'electronics', inStock: true }
})

// Databases use pre-filtering when possible

5. Cache Common Queries

import { LRUCache } from 'lru-cache'

const queryCache = new LRUCache<string, SearchResult[]>({
  max: 1000,
  ttl: 1000 * 60 * 5  // 5 minutes
})

async function cachedSearch(query: string): Promise<SearchResult[]> {
  const cacheKey = hashQuery(query)
  const cached = queryCache.get(cacheKey)

  if (cached) {
    metrics.cacheHit()
    return cached
  }

  const results = await vectorSearch(query)
  queryCache.set(cacheKey, results)
  metrics.cacheMiss()
  return results
}

function hashQuery(query: string): string {
  // Simple hash for cache key
  return crypto.createHash('md5').update(query).digest('hex')
}

11.3 Indexing Optimization

Index Tuning

-- pgvector: Balance between build time and search speed
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 128);

-- For read-heavy workloads, higher m and ef_construction
-- For write-heavy workloads, lower values

Batch Inserts

// Bad: One at a time
for (const doc of documents) {
  await index.upsert([doc])  // Many round trips
}

// Good: Batch operations
const batchSize = 100
for (let i = 0; i < documents.length; i += batchSize) {
  const batch = documents.slice(i, i + batchSize)
  await index.upsert(batch)
}

Parallel Indexing

// Parallel batch processing
async function parallelUpsert(
  documents: Document[],
  batchSize: number = 100,
  concurrency: number = 5
): Promise<void> {
  const batches: Document[][] = []
  for (let i = 0; i < documents.length; i += batchSize) {
    batches.push(documents.slice(i, i + batchSize))
  }

  // Process batches with controlled concurrency
  for (let i = 0; i < batches.length; i += concurrency) {
    const chunk = batches.slice(i, i + concurrency)
    await Promise.all(chunk.map(batch => index.upsert(batch)))
    console.log(`Processed ${Math.min((i + concurrency) * batchSize, documents.length)}/${documents.length}`)
  }
}

11.4 Infrastructure Optimization

Memory Management

Vector databases are memory-intensive:

Memory per vector ≈ 4 bytes × dimensions + metadata overhead

1 million vectors × 1536 dimensions:
= 1M × 1536 × 4 bytes = ~6 GB vectors
+ Index overhead (~2-3x) = ~12-18 GB
+ Metadata = varies

Total: 15-20 GB for 1M vectors

Recommendations:

Size your instances appropriately
Monitor memory usage
Consider quantization for large datasets

Connection Pooling

import { Pool } from 'pg'

// Good: Connection pool
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20,  // Maximum connections
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000
})

// Bad: New connection per query
async function query() {
  const client = new Client()  // Don't do this!
  await client.connect()
  // ...
  await client.end()
}

Regional Deployment

Deploy close to users:

// Example: Multi-region Pinecone
const index = pinecone.index('my-index', 'us-east1-gcp')

// For global apps, use multiple indexes or edge caching

11.5 Monitoring and Debugging

Essential Metrics Dashboard

class VectorDBMetrics {
  private latencies: number[] = []
  private errors = 0
  private queries = 0

  recordQuery(latencyMs: number) {
    this.queries++
    this.latencies.push(latencyMs)
    if (this.latencies.length > 10000) {
      this.latencies = this.latencies.slice(-10000)
    }
  }

  recordError() {
    this.errors++
  }

  getStats() {
    const sorted = [...this.latencies].sort((a, b) => a - b)
    return {
      totalQueries: this.queries,
      errors: this.errors,
      errorRate: this.errors / this.queries,
      p50: sorted[Math.floor(sorted.length * 0.5)] || 0,
      p95: sorted[Math.floor(sorted.length * 0.95)] || 0,
      p99: sorted[Math.floor(sorted.length * 0.99)] || 0
    }
  }
}

const metrics = new VectorDBMetrics()

// Wrap your queries
async function trackedQuery(vector: number[], topK: number) {
  const start = performance.now()
  try {
    const result = await index.query({ vector, topK })
    metrics.recordQuery(performance.now() - start)
    return result
  } catch (error) {
    metrics.recordError()
    throw error
  }
}

Slow Query Detection

const SLOW_QUERY_THRESHOLD = 500  // ms

async function queryWithSlowDetection(params: QueryParams) {
  const start = performance.now()
  const result = await index.query(params)
  const duration = performance.now() - start

  if (duration > SLOW_QUERY_THRESHOLD) {
    console.warn('Slow query detected:', {
      duration: `${duration.toFixed(2)}ms`,
      topK: params.topK,
      hasFilter: !!params.filter,
      filterComplexity: JSON.stringify(params.filter).length
    })
  }

  return result
}

Health Checks

async function healthCheck(): Promise<{
  status: 'healthy' | 'degraded' | 'unhealthy'
  latency: number
  details: string
}> {
  try {
    const testVector = new Array(1536).fill(0).map(() => Math.random())
    const start = performance.now()

    await index.query({
      vector: testVector,
      topK: 1
    })

    const latency = performance.now() - start

    if (latency > 1000) {
      return {
        status: 'degraded',
        latency,
        details: 'Query latency above threshold'
      }
    }

    return {
      status: 'healthy',
      latency,
      details: 'All systems operational'
    }
  } catch (error) {
    return {
      status: 'unhealthy',
      latency: -1,
      details: error instanceof Error ? error.message : 'Unknown error'
    }
  }
}

11.6 Common Bottlenecks and Solutions

1. Embedding Generation

Problem: Embedding API is slow

Solutions:

// Batch embeddings
const embeddings = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: texts  // Array, not single string
})

// Use caching for repeated texts
const embeddingCache = new Map<string, number[]>()

async function getEmbeddingCached(text: string): Promise<number[]> {
  const cached = embeddingCache.get(text)
  if (cached) return cached

  const embedding = await getEmbedding(text)
  embeddingCache.set(text, embedding)
  return embedding
}

// Use smaller/faster models for non-critical paths
// text-embedding-3-small vs text-embedding-3-large

2. Network Latency

Problem: Database is far from application

Solutions:

Deploy in same region
Use connection pooling
Implement edge caching
Consider embedded databases (Chroma) for local-first

3. Cold Starts

Problem: First queries are slow

Solutions:

// Warm up on startup
async function warmUp() {
  const testVector = new Array(1536).fill(0)
  for (let i = 0; i < 10; i++) {
    await index.query({ vector: testVector, topK: 1 })
  }
  console.log('Vector database warmed up')
}

// Keep connections alive
setInterval(async () => {
  await index.query({ vector: testVector, topK: 1 })
}, 60000)  // Ping every minute

4. Large Result Sets

Problem: Fetching too much data

Solutions:

// Fetch IDs only, load full data separately
const results = await index.query({
  vector: queryVector,
  topK: 100,
  includeMetadata: false  // IDs only
})

// Fetch full data only for what you need
const needed = results.matches.slice(0, 10)
const fullData = await fetchDocuments(needed.map(r => r.id))

Key Takeaways

Measure first—know your baseline before optimizing
Reduce topK to only what you need
Cache aggressively for repeated queries
Batch operations for indexing
Monitor continuously to catch regressions

Exercise: Performance Audit

Set up monitoring for your vector database
Run a load test with realistic query patterns
Identify the slowest queries
Apply optimizations from this module
Measure the improvement

Document your findings:

Before/after latency metrics
Bottlenecks identified
Optimizations applied
Remaining issues

Next up: Module 12 - Scaling Considerations

Module 11: Performance Optimization

Making Your Vector Database Fast

Introduction

Performance matters. Users expect instant results, and slow queries can make AI applications feel broken.

By the end of this module, you'll know how to:

Measure and monitor performance
Optimize query latency
Improve throughput
Handle common bottlenecks

11.1 Performance Metrics

Key Metrics to Track

Latency:

p50 (median): Typical user experience
p95: Most users' worst experience
p99: Edge cases and tail latency

Throughput:

Queries per second (QPS)
Insertions per second

Resource Usage:

CPU utilization
Memory usage
Disk I/O

Setting Performance Goals

Use Case	p50 Target	p99 Target	QPS Target
Real-time chat	< 50ms	< 200ms	100+
Search API	< 100ms	< 500ms	50+
Batch processing	< 1s	< 5s	10+
Background jobs	< 5s	< 30s	1+

Measuring Performance

async function measureQueryLatency(
  queryFn: () => Promise<any>,
  iterations: number = 100
): Promise<{ p50: number; p95: number; p99: number }> {
  const latencies: number[] = []

  for (let i = 0; i < iterations; i++) {
    const start = performance.now()
    await queryFn()
    latencies.push(performance.now() - start)
  }

  latencies.sort((a, b) => a - b)

  return {
    p50: latencies[Math.floor(iterations * 0.5)],
    p95: latencies[Math.floor(iterations * 0.95)],
    p99: latencies[Math.floor(iterations * 0.99)]
  }
}

// Usage
const stats = await measureQueryLatency(async () => {
  await index.query({ vector: testVector, topK: 10 })
})
console.log(`p50: ${stats.p50.toFixed(2)}ms, p99: ${stats.p99.toFixed(2)}ms`)

11.2 Query Optimization

1. Reduce TopK

Only fetch what you need:

// Bad: Fetching too many
const results = await index.query({ vector, topK: 100 })
const used = results.matches.slice(0, 5)  // Only use 5!

// Good: Fetch only what you need
const results = await index.query({ vector, topK: 5 })

2. Use Appropriate Search Parameters

Trade accuracy for speed when acceptable:

// pgvector: Lower ef_search for speed
await pool.query('SET hnsw.ef_search = 40')  // Default: 40

// Pinecone: Use faster pod type for non-critical queries
// No direct control, but affects overall performance

// Qdrant: Tune search parameters
await client.search('collection', {
  vector: queryVector,
  limit: 10,
  params: {
    hnsw_ef: 50,  // Lower = faster, less accurate
    exact: false
  }
})

3. Reduce Vector Dimensions

Lower dimensions = faster search:

// At indexing time (OpenAI)
const embedding = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: text,
  dimensions: 512  // Instead of 1536
})

// Trade-off: Slightly lower quality, much faster search

4. Filter Early

Apply metadata filters before vector search:

// Efficient: Filter first, then search smaller set
const results = await index.query({
  vector,
  topK: 10,
  filter: { category: 'electronics', inStock: true }
})

// Databases use pre-filtering when possible

5. Cache Common Queries

import { LRUCache } from 'lru-cache'

const queryCache = new LRUCache<string, SearchResult[]>({
  max: 1000,
  ttl: 1000 * 60 * 5  // 5 minutes
})

async function cachedSearch(query: string): Promise<SearchResult[]> {
  const cacheKey = hashQuery(query)
  const cached = queryCache.get(cacheKey)

  if (cached) {
    metrics.cacheHit()
    return cached
  }

  const results = await vectorSearch(query)
  queryCache.set(cacheKey, results)
  metrics.cacheMiss()
  return results
}

function hashQuery(query: string): string {
  // Simple hash for cache key
  return crypto.createHash('md5').update(query).digest('hex')
}

11.3 Indexing Optimization

Index Tuning

-- pgvector: Balance between build time and search speed
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 128);

-- For read-heavy workloads, higher m and ef_construction
-- For write-heavy workloads, lower values

Batch Inserts

// Bad: One at a time
for (const doc of documents) {
  await index.upsert([doc])  // Many round trips
}

// Good: Batch operations
const batchSize = 100
for (let i = 0; i < documents.length; i += batchSize) {
  const batch = documents.slice(i, i + batchSize)
  await index.upsert(batch)
}

Parallel Indexing

// Parallel batch processing
async function parallelUpsert(
  documents: Document[],
  batchSize: number = 100,
  concurrency: number = 5
): Promise<void> {
  const batches: Document[][] = []
  for (let i = 0; i < documents.length; i += batchSize) {
    batches.push(documents.slice(i, i + batchSize))
  }

  // Process batches with controlled concurrency
  for (let i = 0; i < batches.length; i += concurrency) {
    const chunk = batches.slice(i, i + concurrency)
    await Promise.all(chunk.map(batch => index.upsert(batch)))
    console.log(`Processed ${Math.min((i + concurrency) * batchSize, documents.length)}/${documents.length}`)
  }
}

11.4 Infrastructure Optimization

Memory Management

Vector databases are memory-intensive:

Memory per vector ≈ 4 bytes × dimensions + metadata overhead

1 million vectors × 1536 dimensions:
= 1M × 1536 × 4 bytes = ~6 GB vectors
+ Index overhead (~2-3x) = ~12-18 GB
+ Metadata = varies

Total: 15-20 GB for 1M vectors

Recommendations:

Size your instances appropriately
Monitor memory usage
Consider quantization for large datasets

Connection Pooling

import { Pool } from 'pg'

// Good: Connection pool
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20,  // Maximum connections
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000
})

// Bad: New connection per query
async function query() {
  const client = new Client()  // Don't do this!
  await client.connect()
  // ...
  await client.end()
}

Regional Deployment

Deploy close to users:

// Example: Multi-region Pinecone
const index = pinecone.index('my-index', 'us-east1-gcp')

// For global apps, use multiple indexes or edge caching

11.5 Monitoring and Debugging

Essential Metrics Dashboard

class VectorDBMetrics {
  private latencies: number[] = []
  private errors = 0
  private queries = 0

  recordQuery(latencyMs: number) {
    this.queries++
    this.latencies.push(latencyMs)
    if (this.latencies.length > 10000) {
      this.latencies = this.latencies.slice(-10000)
    }
  }

  recordError() {
    this.errors++
  }

  getStats() {
    const sorted = [...this.latencies].sort((a, b) => a - b)
    return {
      totalQueries: this.queries,
      errors: this.errors,
      errorRate: this.errors / this.queries,
      p50: sorted[Math.floor(sorted.length * 0.5)] || 0,
      p95: sorted[Math.floor(sorted.length * 0.95)] || 0,
      p99: sorted[Math.floor(sorted.length * 0.99)] || 0
    }
  }
}

const metrics = new VectorDBMetrics()

// Wrap your queries
async function trackedQuery(vector: number[], topK: number) {
  const start = performance.now()
  try {
    const result = await index.query({ vector, topK })
    metrics.recordQuery(performance.now() - start)
    return result
  } catch (error) {
    metrics.recordError()
    throw error
  }
}

Slow Query Detection

const SLOW_QUERY_THRESHOLD = 500  // ms

async function queryWithSlowDetection(params: QueryParams) {
  const start = performance.now()
  const result = await index.query(params)
  const duration = performance.now() - start

  if (duration > SLOW_QUERY_THRESHOLD) {
    console.warn('Slow query detected:', {
      duration: `${duration.toFixed(2)}ms`,
      topK: params.topK,
      hasFilter: !!params.filter,
      filterComplexity: JSON.stringify(params.filter).length
    })
  }

  return result
}

Health Checks

async function healthCheck(): Promise<{
  status: 'healthy' | 'degraded' | 'unhealthy'
  latency: number
  details: string
}> {
  try {
    const testVector = new Array(1536).fill(0).map(() => Math.random())
    const start = performance.now()

    await index.query({
      vector: testVector,
      topK: 1
    })

    const latency = performance.now() - start

    if (latency > 1000) {
      return {
        status: 'degraded',
        latency,
        details: 'Query latency above threshold'
      }
    }

    return {
      status: 'healthy',
      latency,
      details: 'All systems operational'
    }
  } catch (error) {
    return {
      status: 'unhealthy',
      latency: -1,
      details: error instanceof Error ? error.message : 'Unknown error'
    }
  }
}

11.6 Common Bottlenecks and Solutions

1. Embedding Generation

Problem: Embedding API is slow

Solutions:

// Batch embeddings
const embeddings = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: texts  // Array, not single string
})

// Use caching for repeated texts
const embeddingCache = new Map<string, number[]>()

async function getEmbeddingCached(text: string): Promise<number[]> {
  const cached = embeddingCache.get(text)
  if (cached) return cached

  const embedding = await getEmbedding(text)
  embeddingCache.set(text, embedding)
  return embedding
}

// Use smaller/faster models for non-critical paths
// text-embedding-3-small vs text-embedding-3-large

2. Network Latency

Problem: Database is far from application

Solutions:

Deploy in same region
Use connection pooling
Implement edge caching
Consider embedded databases (Chroma) for local-first

3. Cold Starts

Problem: First queries are slow

Solutions:

// Warm up on startup
async function warmUp() {
  const testVector = new Array(1536).fill(0)
  for (let i = 0; i < 10; i++) {
    await index.query({ vector: testVector, topK: 1 })
  }
  console.log('Vector database warmed up')
}

// Keep connections alive
setInterval(async () => {
  await index.query({ vector: testVector, topK: 1 })
}, 60000)  // Ping every minute

4. Large Result Sets

Problem: Fetching too much data

Solutions:

// Fetch IDs only, load full data separately
const results = await index.query({
  vector: queryVector,
  topK: 100,
  includeMetadata: false  // IDs only
})

// Fetch full data only for what you need
const needed = results.matches.slice(0, 10)
const fullData = await fetchDocuments(needed.map(r => r.id))

Key Takeaways

Measure first—know your baseline before optimizing
Reduce topK to only what you need
Cache aggressively for repeated queries
Batch operations for indexing
Monitor continuously to catch regressions

Exercise: Performance Audit

Set up monitoring for your vector database
Run a load test with realistic query patterns
Identify the slowest queries
Apply optimizations from this module
Measure the improvement

Document your findings:

Before/after latency metrics
Bottlenecks identified
Optimizations applied
Remaining issues

Next up: Module 12 - Scaling Considerations