Performance and Cost Optimization
Introduction
A RAG system that's accurate but slow or expensive won't succeed in production. This final lesson addresses the practical concerns of building scalable RAG applications: reducing latency, managing costs, and implementing effective monitoring.
The techniques here can mean the difference between a viable product and an unsustainable prototype.
Understanding Costs
Cost Components
A typical RAG request involves multiple paid operations:
| Component | Cost Driver | Typical Cost |
|---|---|---|
| Query embedding | Input tokens | ~$0.00001 per query |
| Vector search | Database compute | Included in Supabase plan |
| LLM generation | Input + output tokens | ~$0.001-0.01 per query |
| Storage | Vectors + text | ~$0.10/GB/month |
LLM generation dominates costs—often 90%+ of API expenses.
Calculating Per-Query Cost
function estimateQueryCost(
contextTokens: number,
outputTokens: number
): { embedding: number; generation: number; total: number } {
// Gemini 1.5 Flash approximate pricing (check current rates)
const EMBED_COST_PER_1K = 0.00001;
const GEN_INPUT_COST_PER_1K = 0.000075;
const GEN_OUTPUT_COST_PER_1K = 0.0003;
const embeddingCost = 0.00001; // Query embedding
const generationCost =
(contextTokens / 1000) * GEN_INPUT_COST_PER_1K +
(outputTokens / 1000) * GEN_OUTPUT_COST_PER_1K;
return {
embedding: embeddingCost,
generation: generationCost,
total: embeddingCost + generationCost
};
}
// Example: 2000 context tokens, 500 output tokens
// ≈ $0.0003 per query
// 10,000 queries/day ≈ $3/day ≈ $90/month
Caching Strategies
What to Cache
| Item | Cache? | Why |
|---|---|---|
| Document embeddings | ✅ Store in DB | Computed once at indexing |
| Query embeddings | ⚠️ Sometimes | Same queries return same embedding |
| LLM responses | ⚠️ Carefully | Same query + context = same response |
| Search results | ⚠️ Short TTL | Documents may update |
Query Embedding Cache
For repeated queries (common in support scenarios):
import { LRUCache } from 'lru-cache';
const embeddingCache = new LRUCache<string, number[]>({
max: 1000, // Store up to 1000 embeddings
ttl: 1000 * 60 * 60, // 1 hour TTL
});
async function getCachedEmbedding(query: string): Promise<number[]> {
const cacheKey = query.toLowerCase().trim();
const cached = embeddingCache.get(cacheKey);
if (cached) {
console.log('Embedding cache hit');
return cached;
}
const embedding = await embedQuery(query);
embeddingCache.set(cacheKey, embedding);
return embedding;
}
Response Caching
Cache full responses for identical requests:
interface CachedResponse {
answer: string;
sources: string[];
timestamp: Date;
}
const responseCache = new LRUCache<string, CachedResponse>({
max: 500,
ttl: 1000 * 60 * 30, // 30 minute TTL
});
function getCacheKey(query: string, topDocIds: string[]): string {
return `${query.toLowerCase().trim()}:${topDocIds.sort().join(',')}`;
}
async function getResponseWithCache(
query: string,
context: string,
docIds: string[]
): Promise<CachedResponse> {
const cacheKey = getCacheKey(query, docIds);
const cached = responseCache.get(cacheKey);
if (cached) {
console.log('Response cache hit');
return cached;
}
const answer = await generateResponse(context, query);
const response = {
answer,
sources: docIds,
timestamp: new Date()
};
responseCache.set(cacheKey, response);
return response;
}
Warning: Response caching can serve stale answers if documents update. Use short TTLs or invalidate on document changes.
Latency Optimization
Typical Latency Breakdown
| Stage | Typical Latency | Optimization Potential |
|---|---|---|
| Parse request | < 5ms | Minimal |
| Embedding generation | 50-200ms | Caching |
| Vector search | 10-50ms | Indexing, connection pooling |
| Context building | < 5ms | Minimal |
| LLM generation | 500-3000ms | Model choice, streaming |
| Total | 600-3000ms |
Parallel Operations
Run independent operations concurrently:
export async function POST(request: Request) {
const { message, conversationId } = await request.json();
// Run in parallel: embedding + history loading
const [queryEmbedding, history] = await Promise.all([
getCachedEmbedding(message),
conversationId ? getConversationHistory(conversationId) : Promise.resolve([])
]);
// Search (depends on embedding)
const { data: docs } = await supabase.rpc('search_docs', {
query_embedding: queryEmbedding,
match_count: 5
});
// Continue with generation...
}
Connection Pooling
Reuse database connections:
// lib/supabase.ts
import { createClient, SupabaseClient } from '@supabase/supabase-js';
let supabaseInstance: SupabaseClient | null = null;
export function getSupabase(): SupabaseClient {
if (!supabaseInstance) {
supabaseInstance = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!,
{
db: {
schema: 'public',
},
auth: {
persistSession: false,
}
}
);
}
return supabaseInstance;
}
Model Selection for Speed
Choose models based on requirements:
type ModelTier = 'fast' | 'balanced' | 'accurate';
function selectModel(tier: ModelTier): string {
switch (tier) {
case 'fast':
return 'gemini-1.5-flash'; // Fastest, cheapest
case 'balanced':
return 'gemini-1.5-flash'; // Good balance
case 'accurate':
return 'gemini-1.5-pro'; // Best quality
}
}
// Use fast model for simple queries, accurate for complex ones
const complexity = assessQueryComplexity(query);
const model = selectModel(complexity > 0.7 ? 'accurate' : 'fast');
Cost Reduction Techniques
Context Window Management
Less context = lower cost + faster response:
function optimizeContext(
docs: SearchResult[],
maxTokens: number = 2000
): string {
const contexts: string[] = [];
let totalTokens = 0;
for (const doc of docs) {
const docTokens = estimateTokens(doc.content);
if (totalTokens + docTokens > maxTokens) {
// Truncate this doc to fit
const remainingTokens = maxTokens - totalTokens;
if (remainingTokens > 100) {
contexts.push(truncateToTokens(doc.content, remainingTokens));
}
break;
}
contexts.push(doc.content);
totalTokens += docTokens;
}
return contexts.join('\n\n---\n\n');
}
Response Length Limits
Encourage concise responses:
const systemInstruction = `...
RESPONSE LENGTH:
- Keep responses concise (2-4 paragraphs for most questions)
- Use bullet points for lists
- Only elaborate when the user asks for more detail
`;
const generationConfig = {
maxOutputTokens: 500, // Limit response length
temperature: 0.1
};
Tiered Service Levels
Offer different quality/cost tiers:
interface ServiceTier {
maxContextDocs: number;
maxOutputTokens: number;
model: string;
cacheEnabled: boolean;
}
const tiers: Record<string, ServiceTier> = {
free: {
maxContextDocs: 3,
maxOutputTokens: 300,
model: 'gemini-1.5-flash',
cacheEnabled: true
},
pro: {
maxContextDocs: 5,
maxOutputTokens: 1000,
model: 'gemini-1.5-flash',
cacheEnabled: true
},
enterprise: {
maxContextDocs: 10,
maxOutputTokens: 2000,
model: 'gemini-1.5-pro',
cacheEnabled: false // Always fresh
}
};
Monitoring and Observability
Key Metrics to Track
interface RAGMetrics {
// Latency
embeddingLatencyMs: number;
searchLatencyMs: number;
generationLatencyMs: number;
totalLatencyMs: number;
// Quality
topResultSimilarity: number;
resultsAboveThreshold: number;
// Cost
inputTokens: number;
outputTokens: number;
estimatedCost: number;
// Usage
cacheHit: boolean;
userId?: string;
conversationId?: string;
}
async function trackRAGRequest(
fn: () => Promise<any>,
metadata: Partial<RAGMetrics>
): Promise<any> {
const start = Date.now();
try {
const result = await fn();
const metrics: RAGMetrics = {
...metadata,
totalLatencyMs: Date.now() - start,
} as RAGMetrics;
// Send to monitoring service
await logMetrics(metrics);
return result;
} catch (error) {
await logError(error, metadata);
throw error;
}
}
Logging Best Practices
// Structured logging
function logRAGRequest(metrics: RAGMetrics) {
console.log(JSON.stringify({
event: 'rag_request',
timestamp: new Date().toISOString(),
...metrics
}));
}
// Sample log output
// {"event":"rag_request","timestamp":"2024-01-15T10:30:00Z","totalLatencyMs":1250,"inputTokens":2500,"outputTokens":350,"cacheHit":false,"topResultSimilarity":0.85}
Alerting Thresholds
Set up alerts for anomalies:
const ALERT_THRESHOLDS = {
latency: 5000, // Alert if > 5 seconds
errorRate: 0.05, // Alert if > 5% errors
costPerHour: 10, // Alert if > $10/hour
cacheHitRate: 0.2 // Alert if cache hit rate < 20%
};
function checkAlerts(metrics: RAGMetrics) {
if (metrics.totalLatencyMs > ALERT_THRESHOLDS.latency) {
sendAlert('High latency detected', metrics);
}
// ... more checks
}
Cost Monitoring Dashboard
Essential Dashboard Components
Real-time metrics:
- Requests per minute
- Average latency
- Error rate
- Cache hit rate
Cost tracking:
- Hourly/daily/monthly spend
- Cost per user
- Cost per conversation
Quality indicators:
- Average similarity scores
- User feedback ratings
- Retrieval success rate
Simple Cost Tracking
// Track costs in database
interface UsageRecord {
id: string;
userId: string;
timestamp: Date;
inputTokens: number;
outputTokens: number;
estimatedCost: number;
}
async function recordUsage(record: UsageRecord) {
await supabase.from('usage_records').insert(record);
}
// Query daily costs
async function getDailyCost(date: Date): Promise<number> {
const { data } = await supabase
.from('usage_records')
.select('estimatedCost')
.gte('timestamp', startOfDay(date))
.lt('timestamp', endOfDay(date));
return data?.reduce((sum, r) => sum + r.estimatedCost, 0) || 0;
}
Summary
In this lesson, we covered production optimization:
Key Takeaways:
-
Understand cost components: LLM generation dominates; optimize there first
-
Cache strategically: Query embeddings and responses for repeated patterns
-
Reduce latency with parallelism: Run independent operations concurrently
-
Manage context wisely: Less context = lower cost + faster response
-
Monitor everything: Track latency, cost, and quality metrics
-
Set up alerts: Catch anomalies before they become problems
Module 5 Complete
Congratulations! You've completed Module 5: Optimization and Advanced RAG Techniques. You now understand:
- Hybrid search and retrieval improvement techniques
- Conversational RAG with context management
- Performance and cost optimization strategies
"Premature optimization is the root of all evil, but mature optimization is the key to production." — Adapted from Donald Knuth

