Module 12: Scaling Considerations
From Prototype to Production
Introduction
Your prototype works great with 10,000 vectors. But what happens when you have 10 million? Or 100 million?
By the end of this module, you'll understand:
- Vertical vs. horizontal scaling
- Sharding strategies
- Replication for availability
- When to scale and when to optimize
12.1 Understanding Scale
Scale Dimensions
Data Scale:
- Number of vectors
- Vector dimensions
- Metadata size
Query Scale:
- Queries per second (QPS)
- Concurrent users
- Query complexity (filters, hybrid search)
Update Scale:
- Insertions per second
- Update frequency
- Deletion patterns
Rough Scaling Tiers
| Vectors | Description | Typical Setup |
|---|---|---|
| < 100K | Small | Single node, any database |
| 100K - 1M | Medium | Single node, optimized |
| 1M - 10M | Large | Powerful single node or start sharding |
| 10M - 100M | Very Large | Sharding required |
| > 100M | Massive | Distributed cluster |
12.2 Vertical Scaling
What is Vertical Scaling?
Adding more resources to a single machine:
- More CPU cores
- More RAM
- Faster storage (NVMe SSD)
When Vertical Scaling Works
Vectors × Dimensions × Bytes per Float + Index Overhead + Metadata
= Total Memory Required
Example:
5M vectors × 1536 dimensions × 4 bytes = 30 GB vectors
+ 2x index overhead = 60 GB additional
+ Metadata ~10 GB
= ~100 GB total
This fits on a single machine!
Vertical Scaling Limits
| Component | Practical Limit | Notes |
|---|---|---|
| RAM | 1-2 TB | High-end instances |
| CPU | 64-128 cores | Diminishing returns |
| Storage | Unlimited | But I/O becomes bottleneck |
When to Scale Vertically
- Dataset fits in memory
- Simpler operations
- Lower operational complexity
- Haven't hit performance limits
// Check if vertical scaling is an option
function canScaleVertically(
vectorCount: number,
dimensions: number,
replicationFactor: number = 1
): boolean {
const memoryPerVector = dimensions * 4 // 4 bytes per float
const indexOverhead = 2.5 // HNSW overhead
const metadataEstimate = 1000 // 1KB per vector
const totalBytes = vectorCount *
(memoryPerVector * indexOverhead + metadataEstimate) *
replicationFactor
const maxMemoryBytes = 1024 * 1024 * 1024 * 1024 // 1 TB
console.log(`Estimated memory: ${(totalBytes / 1e9).toFixed(2)} GB`)
return totalBytes < maxMemoryBytes
}
canScaleVertically(10_000_000, 1536)
// Estimated memory: ~150 GB - Yes, can scale vertically
12.3 Horizontal Scaling
What is Horizontal Scaling?
Adding more machines (nodes) to distribute the load.
Sharding: Split data across nodes Replication: Copy data to multiple nodes
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ (A-H) │ │ (I-P) │ │ (Q-Z) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Replica │ │ Replica │ │ Replica │
└─────────────┘ └─────────────┘ └─────────────┘
Sharding Strategies
1. Hash-Based Sharding
function getShard(vectorId: string, numShards: number): number {
const hash = crypto.createHash('md5').update(vectorId).digest()
return hash.readUInt32BE(0) % numShards
}
// Insert
const shardId = getShard(document.id, 3)
await shards[shardId].upsert([document])
// Query (must query all shards)
async function queryAllShards(vector: number[], topK: number) {
const results = await Promise.all(
shards.map(shard => shard.query({ vector, topK }))
)
// Merge and re-rank
return mergeResults(results, topK)
}
2. Range-Based Sharding
// Shard by date range
function getShardByDate(date: Date): number {
const year = date.getFullYear()
if (year < 2023) return 0 // Archive
if (year === 2023) return 1
if (year === 2024) return 2
return 3 // Current
}
// Shard by category
const categoryShards = {
'technology': 0,
'science': 1,
'business': 2,
'other': 3
}
3. Managed Sharding (Pinecone)
// Pinecone handles sharding automatically
await pinecone.createIndex({
name: 'my-index',
dimension: 1536,
metric: 'cosine',
spec: {
pod: {
environment: 'us-east1-gcp',
podType: 'p2.x1',
pods: 4, // Auto-sharded across 4 pods
shards: 4
}
}
})
Query Routing
For user queries, you may need to query all shards:
async function distributedQuery(
shards: VectorIndex[],
vector: number[],
topK: number,
filter?: object
): Promise<Result[]> {
// Query all shards in parallel
const shardResults = await Promise.all(
shards.map(shard => shard.query({ vector, topK, filter }))
)
// Merge results from all shards
const allResults = shardResults.flatMap(r => r.matches)
// Re-sort by score and take top K
return allResults
.sort((a, b) => b.score - a.score)
.slice(0, topK)
}
12.4 Replication
Why Replicate?
- High Availability: Survive node failures
- Read Scaling: Distribute read load
- Geographic Distribution: Lower latency for global users
Replication Patterns
Leader-Follower:
┌─────────────┐
│ Leader │ ← Writes
│ (Primary) │
└──────┬──────┘
│
┌───────┴───────┐
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Follower 1 │ │ Follower 2 │ ← Reads
│ (Replica) │ │ (Replica) │
└─────────────┘ └─────────────┘
Multi-Leader:
┌─────────────┐ ┌─────────────┐
│ Leader 1 │◄───►│ Leader 2 │
│ (US East) │ │ (EU West) │
└─────────────┘ └─────────────┘
↕ ↕
Local Local
Reads Reads
Database-Specific Replication
Pinecone:
// Replicas for throughput
await pinecone.createIndex({
spec: {
pod: {
pods: 2,
replicas: 2 // 2 replicas per pod
}
}
})
Qdrant:
// Create collection with replication
await client.createCollection('my-collection', {
vectors: { size: 1536, distance: 'Cosine' },
replication_factor: 2, // 2 copies of data
write_consistency_factor: 1
})
pgvector (PostgreSQL):
-- Use PostgreSQL streaming replication
-- Or managed services like Supabase/Neon with built-in replication
12.5 Scaling Decision Framework
When to Scale
Is query latency acceptable?
├─ No → Check index tuning first
│ └─ Still slow? → Scale
└─ Yes
↓
Is throughput sufficient?
├─ No → Add replicas (read scaling)
│ └─ Still insufficient? → Shard data
└─ Yes
↓
Is update latency acceptable?
├─ No → Scale write capacity (more pods/shards)
└─ Yes
↓
Is storage sufficient?
├─ No → Add storage or shard data
└─ Yes → Current setup is fine
Scaling Checklist
Before scaling, verify you've:
-
Optimized indexes
- Appropriate HNSW parameters
- Right index type for workload
-
Optimized queries
- Minimal topK
- Efficient filters
- Caching where possible
-
Optimized data
- Removed unused vectors
- Reduced dimensions if possible
- Cleaned up metadata
-
Profiled bottlenecks
- Know what's actually slow
- Measured, not guessed
12.6 Managed vs. Self-Hosted Scaling
Managed Services
Pinecone:
- Auto-scaling available
- Simple pod/replica configuration
- Limited control but minimal ops
Qdrant Cloud:
- Managed clusters
- Auto-scaling options
- More control than Pinecone
Self-Hosted Scaling
Kubernetes + Qdrant:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: qdrant
spec:
replicas: 3
template:
spec:
containers:
- name: qdrant
image: qdrant/qdrant
resources:
requests:
memory: "32Gi"
cpu: "8"
limits:
memory: "64Gi"
cpu: "16"
PostgreSQL + pgvector:
- Use read replicas for query scaling
- Consider Citus for horizontal scaling
- Managed options: Supabase, Neon, RDS
12.7 Cost Considerations at Scale
Cost Components
| Component | Scales With |
|---|---|
| Storage | Vector count, dimensions |
| Compute | Query volume, complexity |
| Network | Data transfer, cross-region |
| Ops | Complexity of setup |
Rough Cost Estimation
function estimateMonthlyCost(
vectors: number,
dimensions: number,
qps: number,
provider: 'pinecone' | 'qdrant-cloud' | 'self-hosted'
): number {
// Very rough estimates - check current pricing
const gbStorage = (vectors * dimensions * 4) / 1e9
switch (provider) {
case 'pinecone':
// Serverless pricing (approximate)
return gbStorage * 0.1 + qps * 0.001
case 'qdrant-cloud':
// Instance-based (approximate)
return 100 + gbStorage * 0.05
case 'self-hosted':
// EC2/GCE costs (approximate)
const instanceCost = Math.ceil(gbStorage / 100) * 200 // Per instance
return instanceCost
}
}
console.log('1M vectors, 1536 dims, 100 QPS:')
console.log('Pinecone:', estimateMonthlyCost(1_000_000, 1536, 100, 'pinecone'))
console.log('Qdrant:', estimateMonthlyCost(1_000_000, 1536, 100, 'qdrant-cloud'))
console.log('Self-hosted:', estimateMonthlyCost(1_000_000, 1536, 100, 'self-hosted'))
Key Takeaways
- Scale vertically first until you hit limits
- Sharding distributes data, replication provides redundancy
- Query all shards for vector search (unlike key-value stores)
- Managed services simplify scaling but cost more
- Optimize before scaling—often cheaper and faster
Exercise: Scaling Plan
For a hypothetical application with:
- 50 million documents
- 1536-dimension embeddings
- 1,000 queries per second peak
- 99.9% availability requirement
- $10,000/month budget
Create a scaling plan:
- Calculate storage requirements
- Choose sharding strategy
- Determine replication factor
- Select infrastructure (managed vs. self-hosted)
- Estimate costs
Next up: Module 13 - Cost Comparison

