Module 12: Scaling Considerations

From Prototype to Production

Introduction

Your prototype works great with 10,000 vectors. But what happens when you have 10 million? Or 100 million?

By the end of this module, you'll understand:

Vertical vs. horizontal scaling
Sharding strategies
Replication for availability
When to scale and when to optimize

12.1 Understanding Scale

Scale Dimensions

Data Scale:

Number of vectors
Vector dimensions
Metadata size

Query Scale:

Queries per second (QPS)
Concurrent users
Query complexity (filters, hybrid search)

Update Scale:

Insertions per second
Update frequency
Deletion patterns

Rough Scaling Tiers

Vectors	Description	Typical Setup
< 100K	Small	Single node, any database
100K - 1M	Medium	Single node, optimized
1M - 10M	Large	Powerful single node or start sharding
10M - 100M	Very Large	Sharding required
> 100M	Massive	Distributed cluster

12.2 Vertical Scaling

What is Vertical Scaling?

Adding more resources to a single machine:

More CPU cores
More RAM
Faster storage (NVMe SSD)

When Vertical Scaling Works

Vectors × Dimensions × Bytes per Float + Index Overhead + Metadata
= Total Memory Required

Example:
5M vectors × 1536 dimensions × 4 bytes = 30 GB vectors
+ 2x index overhead = 60 GB additional
+ Metadata ~10 GB
= ~100 GB total

This fits on a single machine!

Vertical Scaling Limits

Component	Practical Limit	Notes
RAM	1-2 TB	High-end instances
CPU	64-128 cores	Diminishing returns
Storage	Unlimited	But I/O becomes bottleneck

When to Scale Vertically

Dataset fits in memory
Simpler operations
Lower operational complexity
Haven't hit performance limits

// Check if vertical scaling is an option
function canScaleVertically(
  vectorCount: number,
  dimensions: number,
  replicationFactor: number = 1
): boolean {
  const memoryPerVector = dimensions * 4  // 4 bytes per float
  const indexOverhead = 2.5  // HNSW overhead
  const metadataEstimate = 1000  // 1KB per vector

  const totalBytes = vectorCount *
    (memoryPerVector * indexOverhead + metadataEstimate) *
    replicationFactor

  const maxMemoryBytes = 1024 * 1024 * 1024 * 1024  // 1 TB

  console.log(`Estimated memory: ${(totalBytes / 1e9).toFixed(2)} GB`)
  return totalBytes < maxMemoryBytes
}

canScaleVertically(10_000_000, 1536)
// Estimated memory: ~150 GB - Yes, can scale vertically

12.3 Horizontal Scaling

What is Horizontal Scaling?

Adding more machines (nodes) to distribute the load.

Sharding: Split data across nodes Replication: Copy data to multiple nodes

                    ┌─────────────────┐
                    │  Load Balancer  │
                    └────────┬────────┘
           ┌─────────────────┼─────────────────┐
           ▼                 ▼                 ▼
    ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
    │  Shard 1    │   │  Shard 2    │   │  Shard 3    │
    │  (A-H)      │   │  (I-P)      │   │  (Q-Z)      │
    └──────┬──────┘   └──────┬──────┘   └──────┬──────┘
           │                 │                 │
           ▼                 ▼                 ▼
    ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
    │  Replica    │   │  Replica    │   │  Replica    │
    └─────────────┘   └─────────────┘   └─────────────┘

Sharding Strategies

1. Hash-Based Sharding

function getShard(vectorId: string, numShards: number): number {
  const hash = crypto.createHash('md5').update(vectorId).digest()
  return hash.readUInt32BE(0) % numShards
}

// Insert
const shardId = getShard(document.id, 3)
await shards[shardId].upsert([document])

// Query (must query all shards)
async function queryAllShards(vector: number[], topK: number) {
  const results = await Promise.all(
    shards.map(shard => shard.query({ vector, topK }))
  )
  // Merge and re-rank
  return mergeResults(results, topK)
}

2. Range-Based Sharding

// Shard by date range
function getShardByDate(date: Date): number {
  const year = date.getFullYear()
  if (year < 2023) return 0  // Archive
  if (year === 2023) return 1
  if (year === 2024) return 2
  return 3  // Current
}

// Shard by category
const categoryShards = {
  'technology': 0,
  'science': 1,
  'business': 2,
  'other': 3
}

3. Managed Sharding (Pinecone)

// Pinecone handles sharding automatically
await pinecone.createIndex({
  name: 'my-index',
  dimension: 1536,
  metric: 'cosine',
  spec: {
    pod: {
      environment: 'us-east1-gcp',
      podType: 'p2.x1',
      pods: 4,  // Auto-sharded across 4 pods
      shards: 4
    }
  }
})

Query Routing

For user queries, you may need to query all shards:

async function distributedQuery(
  shards: VectorIndex[],
  vector: number[],
  topK: number,
  filter?: object
): Promise<Result[]> {
  // Query all shards in parallel
  const shardResults = await Promise.all(
    shards.map(shard => shard.query({ vector, topK, filter }))
  )

  // Merge results from all shards
  const allResults = shardResults.flatMap(r => r.matches)

  // Re-sort by score and take top K
  return allResults
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
}

12.4 Replication

Why Replicate?

High Availability: Survive node failures
Read Scaling: Distribute read load
Geographic Distribution: Lower latency for global users

Replication Patterns

Leader-Follower:

        ┌─────────────┐
        │   Leader    │ ← Writes
        │  (Primary)  │
        └──────┬──────┘
               │
       ┌───────┴───────┐
       ▼               ▼
┌─────────────┐ ┌─────────────┐
│  Follower 1 │ │  Follower 2 │ ← Reads
│  (Replica)  │ │  (Replica)  │
└─────────────┘ └─────────────┘

Multi-Leader:

┌─────────────┐     ┌─────────────┐
│  Leader 1   │◄───►│  Leader 2   │
│  (US East)  │     │  (EU West)  │
└─────────────┘     └─────────────┘
      ↕                   ↕
   Local             Local
   Reads             Reads

Database-Specific Replication

Pinecone:

// Replicas for throughput
await pinecone.createIndex({
  spec: {
    pod: {
      pods: 2,
      replicas: 2  // 2 replicas per pod
    }
  }
})

Qdrant:

// Create collection with replication
await client.createCollection('my-collection', {
  vectors: { size: 1536, distance: 'Cosine' },
  replication_factor: 2,  // 2 copies of data
  write_consistency_factor: 1
})

pgvector (PostgreSQL):

-- Use PostgreSQL streaming replication
-- Or managed services like Supabase/Neon with built-in replication

12.5 Scaling Decision Framework

When to Scale

Is query latency acceptable?
├─ No → Check index tuning first
│       └─ Still slow? → Scale
└─ Yes
    ↓
Is throughput sufficient?
├─ No → Add replicas (read scaling)
│       └─ Still insufficient? → Shard data
└─ Yes
    ↓
Is update latency acceptable?
├─ No → Scale write capacity (more pods/shards)
└─ Yes
    ↓
Is storage sufficient?
├─ No → Add storage or shard data
└─ Yes → Current setup is fine

Scaling Checklist

Before scaling, verify you've:

Optimized indexes
- Appropriate HNSW parameters
- Right index type for workload
Optimized queries
- Minimal topK
- Efficient filters
- Caching where possible
Optimized data
- Removed unused vectors
- Reduced dimensions if possible
- Cleaned up metadata
Profiled bottlenecks
- Know what's actually slow
- Measured, not guessed

12.6 Managed vs. Self-Hosted Scaling

Managed Services

Pinecone:

Auto-scaling available
Simple pod/replica configuration
Limited control but minimal ops

Qdrant Cloud:

Managed clusters
Auto-scaling options
More control than Pinecone

Self-Hosted Scaling

Kubernetes + Qdrant:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant
        resources:
          requests:
            memory: "32Gi"
            cpu: "8"
          limits:
            memory: "64Gi"
            cpu: "16"

PostgreSQL + pgvector:

Use read replicas for query scaling
Consider Citus for horizontal scaling
Managed options: Supabase, Neon, RDS

12.7 Cost Considerations at Scale

Cost Components

Component	Scales With
Storage	Vector count, dimensions
Compute	Query volume, complexity
Network	Data transfer, cross-region
Ops	Complexity of setup

Rough Cost Estimation

function estimateMonthlyCost(
  vectors: number,
  dimensions: number,
  qps: number,
  provider: 'pinecone' | 'qdrant-cloud' | 'self-hosted'
): number {
  // Very rough estimates - check current pricing
  const gbStorage = (vectors * dimensions * 4) / 1e9

  switch (provider) {
    case 'pinecone':
      // Serverless pricing (approximate)
      return gbStorage * 0.1 + qps * 0.001

    case 'qdrant-cloud':
      // Instance-based (approximate)
      return 100 + gbStorage * 0.05

    case 'self-hosted':
      // EC2/GCE costs (approximate)
      const instanceCost = Math.ceil(gbStorage / 100) * 200  // Per instance
      return instanceCost
  }
}

console.log('1M vectors, 1536 dims, 100 QPS:')
console.log('Pinecone:', estimateMonthlyCost(1_000_000, 1536, 100, 'pinecone'))
console.log('Qdrant:', estimateMonthlyCost(1_000_000, 1536, 100, 'qdrant-cloud'))
console.log('Self-hosted:', estimateMonthlyCost(1_000_000, 1536, 100, 'self-hosted'))

Key Takeaways

Scale vertically first until you hit limits
Sharding distributes data, replication provides redundancy
Query all shards for vector search (unlike key-value stores)
Managed services simplify scaling but cost more
Optimize before scaling—often cheaper and faster

Exercise: Scaling Plan

For a hypothetical application with:

50 million documents
1536-dimension embeddings
1,000 queries per second peak
99.9% availability requirement
$10,000/month budget

Create a scaling plan:

Calculate storage requirements
Choose sharding strategy
Determine replication factor
Select infrastructure (managed vs. self-hosted)
Estimate costs

Next up: Module 13 - Cost Comparison

Module 12: Scaling Considerations

From Prototype to Production

Introduction

Your prototype works great with 10,000 vectors. But what happens when you have 10 million? Or 100 million?

By the end of this module, you'll understand:

Vertical vs. horizontal scaling
Sharding strategies
Replication for availability
When to scale and when to optimize

12.1 Understanding Scale

Scale Dimensions

Data Scale:

Number of vectors
Vector dimensions
Metadata size

Query Scale:

Queries per second (QPS)
Concurrent users
Query complexity (filters, hybrid search)

Update Scale:

Insertions per second
Update frequency
Deletion patterns

Rough Scaling Tiers

Vectors	Description	Typical Setup
< 100K	Small	Single node, any database
100K - 1M	Medium	Single node, optimized
1M - 10M	Large	Powerful single node or start sharding
10M - 100M	Very Large	Sharding required
> 100M	Massive	Distributed cluster

12.2 Vertical Scaling

What is Vertical Scaling?

Adding more resources to a single machine:

More CPU cores
More RAM
Faster storage (NVMe SSD)

When Vertical Scaling Works

Vectors × Dimensions × Bytes per Float + Index Overhead + Metadata
= Total Memory Required

Example:
5M vectors × 1536 dimensions × 4 bytes = 30 GB vectors
+ 2x index overhead = 60 GB additional
+ Metadata ~10 GB
= ~100 GB total

This fits on a single machine!

Vertical Scaling Limits

Component	Practical Limit	Notes
RAM	1-2 TB	High-end instances
CPU	64-128 cores	Diminishing returns
Storage	Unlimited	But I/O becomes bottleneck

When to Scale Vertically

Dataset fits in memory
Simpler operations
Lower operational complexity
Haven't hit performance limits

// Check if vertical scaling is an option
function canScaleVertically(
  vectorCount: number,
  dimensions: number,
  replicationFactor: number = 1
): boolean {
  const memoryPerVector = dimensions * 4  // 4 bytes per float
  const indexOverhead = 2.5  // HNSW overhead
  const metadataEstimate = 1000  // 1KB per vector

  const totalBytes = vectorCount *
    (memoryPerVector * indexOverhead + metadataEstimate) *
    replicationFactor

  const maxMemoryBytes = 1024 * 1024 * 1024 * 1024  // 1 TB

  console.log(`Estimated memory: ${(totalBytes / 1e9).toFixed(2)} GB`)
  return totalBytes < maxMemoryBytes
}

canScaleVertically(10_000_000, 1536)
// Estimated memory: ~150 GB - Yes, can scale vertically

12.3 Horizontal Scaling

What is Horizontal Scaling?

Adding more machines (nodes) to distribute the load.

Sharding: Split data across nodes Replication: Copy data to multiple nodes

                    ┌─────────────────┐
                    │  Load Balancer  │
                    └────────┬────────┘
           ┌─────────────────┼─────────────────┐
           ▼                 ▼                 ▼
    ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
    │  Shard 1    │   │  Shard 2    │   │  Shard 3    │
    │  (A-H)      │   │  (I-P)      │   │  (Q-Z)      │
    └──────┬──────┘   └──────┬──────┘   └──────┬──────┘
           │                 │                 │
           ▼                 ▼                 ▼
    ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
    │  Replica    │   │  Replica    │   │  Replica    │
    └─────────────┘   └─────────────┘   └─────────────┘

Sharding Strategies

1. Hash-Based Sharding

function getShard(vectorId: string, numShards: number): number {
  const hash = crypto.createHash('md5').update(vectorId).digest()
  return hash.readUInt32BE(0) % numShards
}

// Insert
const shardId = getShard(document.id, 3)
await shards[shardId].upsert([document])

// Query (must query all shards)
async function queryAllShards(vector: number[], topK: number) {
  const results = await Promise.all(
    shards.map(shard => shard.query({ vector, topK }))
  )
  // Merge and re-rank
  return mergeResults(results, topK)
}

2. Range-Based Sharding

// Shard by date range
function getShardByDate(date: Date): number {
  const year = date.getFullYear()
  if (year < 2023) return 0  // Archive
  if (year === 2023) return 1
  if (year === 2024) return 2
  return 3  // Current
}

// Shard by category
const categoryShards = {
  'technology': 0,
  'science': 1,
  'business': 2,
  'other': 3
}

3. Managed Sharding (Pinecone)

// Pinecone handles sharding automatically
await pinecone.createIndex({
  name: 'my-index',
  dimension: 1536,
  metric: 'cosine',
  spec: {
    pod: {
      environment: 'us-east1-gcp',
      podType: 'p2.x1',
      pods: 4,  // Auto-sharded across 4 pods
      shards: 4
    }
  }
})

Query Routing

For user queries, you may need to query all shards:

async function distributedQuery(
  shards: VectorIndex[],
  vector: number[],
  topK: number,
  filter?: object
): Promise<Result[]> {
  // Query all shards in parallel
  const shardResults = await Promise.all(
    shards.map(shard => shard.query({ vector, topK, filter }))
  )

  // Merge results from all shards
  const allResults = shardResults.flatMap(r => r.matches)

  // Re-sort by score and take top K
  return allResults
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
}

12.4 Replication

Why Replicate?

High Availability: Survive node failures
Read Scaling: Distribute read load
Geographic Distribution: Lower latency for global users

Replication Patterns

Leader-Follower:

        ┌─────────────┐
        │   Leader    │ ← Writes
        │  (Primary)  │
        └──────┬──────┘
               │
       ┌───────┴───────┐
       ▼               ▼
┌─────────────┐ ┌─────────────┐
│  Follower 1 │ │  Follower 2 │ ← Reads
│  (Replica)  │ │  (Replica)  │
└─────────────┘ └─────────────┘

Multi-Leader:

┌─────────────┐     ┌─────────────┐
│  Leader 1   │◄───►│  Leader 2   │
│  (US East)  │     │  (EU West)  │
└─────────────┘     └─────────────┘
      ↕                   ↕
   Local             Local
   Reads             Reads

Database-Specific Replication

Pinecone:

// Replicas for throughput
await pinecone.createIndex({
  spec: {
    pod: {
      pods: 2,
      replicas: 2  // 2 replicas per pod
    }
  }
})

Qdrant:

// Create collection with replication
await client.createCollection('my-collection', {
  vectors: { size: 1536, distance: 'Cosine' },
  replication_factor: 2,  // 2 copies of data
  write_consistency_factor: 1
})

pgvector (PostgreSQL):

-- Use PostgreSQL streaming replication
-- Or managed services like Supabase/Neon with built-in replication

12.5 Scaling Decision Framework

When to Scale

Is query latency acceptable?
├─ No → Check index tuning first
│       └─ Still slow? → Scale
└─ Yes
    ↓
Is throughput sufficient?
├─ No → Add replicas (read scaling)
│       └─ Still insufficient? → Shard data
└─ Yes
    ↓
Is update latency acceptable?
├─ No → Scale write capacity (more pods/shards)
└─ Yes
    ↓
Is storage sufficient?
├─ No → Add storage or shard data
└─ Yes → Current setup is fine

Scaling Checklist

Before scaling, verify you've:

Optimized indexes
- Appropriate HNSW parameters
- Right index type for workload
Optimized queries
- Minimal topK
- Efficient filters
- Caching where possible
Optimized data
- Removed unused vectors
- Reduced dimensions if possible
- Cleaned up metadata
Profiled bottlenecks
- Know what's actually slow
- Measured, not guessed

12.6 Managed vs. Self-Hosted Scaling

Managed Services

Pinecone:

Auto-scaling available
Simple pod/replica configuration
Limited control but minimal ops

Qdrant Cloud:

Managed clusters
Auto-scaling options
More control than Pinecone

Self-Hosted Scaling

Kubernetes + Qdrant:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant
        resources:
          requests:
            memory: "32Gi"
            cpu: "8"
          limits:
            memory: "64Gi"
            cpu: "16"

PostgreSQL + pgvector:

Use read replicas for query scaling
Consider Citus for horizontal scaling
Managed options: Supabase, Neon, RDS

12.7 Cost Considerations at Scale

Cost Components

Component	Scales With
Storage	Vector count, dimensions
Compute	Query volume, complexity
Network	Data transfer, cross-region
Ops	Complexity of setup

Rough Cost Estimation

function estimateMonthlyCost(
  vectors: number,
  dimensions: number,
  qps: number,
  provider: 'pinecone' | 'qdrant-cloud' | 'self-hosted'
): number {
  // Very rough estimates - check current pricing
  const gbStorage = (vectors * dimensions * 4) / 1e9

  switch (provider) {
    case 'pinecone':
      // Serverless pricing (approximate)
      return gbStorage * 0.1 + qps * 0.001

    case 'qdrant-cloud':
      // Instance-based (approximate)
      return 100 + gbStorage * 0.05

    case 'self-hosted':
      // EC2/GCE costs (approximate)
      const instanceCost = Math.ceil(gbStorage / 100) * 200  // Per instance
      return instanceCost
  }
}

console.log('1M vectors, 1536 dims, 100 QPS:')
console.log('Pinecone:', estimateMonthlyCost(1_000_000, 1536, 100, 'pinecone'))
console.log('Qdrant:', estimateMonthlyCost(1_000_000, 1536, 100, 'qdrant-cloud'))
console.log('Self-hosted:', estimateMonthlyCost(1_000_000, 1536, 100, 'self-hosted'))

Key Takeaways

Scale vertically first until you hit limits
Sharding distributes data, replication provides redundancy
Query all shards for vector search (unlike key-value stores)
Managed services simplify scaling but cost more
Optimize before scaling—often cheaper and faster

Exercise: Scaling Plan

For a hypothetical application with:

50 million documents
1536-dimension embeddings
1,000 queries per second peak
99.9% availability requirement
$10,000/month budget

Create a scaling plan:

Calculate storage requirements
Choose sharding strategy
Determine replication factor
Select infrastructure (managed vs. self-hosted)
Estimate costs

Next up: Module 13 - Cost Comparison