Latency and Performance Optimization
Chain latency compounds with each step. This lesson covers strategies for minimizing response times while maintaining quality.
Understanding Chain Latency
Total Latency = Σ(API Call Time) + Network Overhead + Processing Time
Sequential chains multiply latency. A 5-step chain with 2s per step = 10s+ total.
Latency Components
Breaking Down Response Time
// Typical latency breakdown for a single API call
const latencyComponents = {
networkRoundTrip: '50-200ms',
tokenGeneration: '20-50ms per token',
promptProcessing: '100-500ms (varies with length)',
modelLoading: '0ms (warm) to 2000ms (cold start)'
};
// For a chain
const chainLatency = {
step1: 1500, // ms
step2: 2000,
step3: 1800,
step4: 2200,
overhead: 200,
total: 7700 // 7.7 seconds end-to-end
};
Optimization Strategies
1. Parallel Execution
Loading Prompt Playground...
Implementing Parallel Steps
async function parallelOptimizedChain(document) {
// Phase 1: Independent steps run in parallel
const [metadata, entities, docType, summary, tags] = await Promise.all([
extractMetadata(document), // 1.5s
identifyEntities(document), // 2.0s
classifyDocument(document), // 1.0s
summarizeContent(document), // 2.5s
generateTags(document) // 1.0s
]);
// Phase 1 total: 2.5s (longest step)
// Phase 2: Dependent step
const report = await createReport({
metadata, entities, docType, summary, tags
});
// Phase 2: 2.0s
// Total: 4.5s instead of 10s (55% reduction)
return report;
}
2. Streaming Responses
async function streamingChain(input) {
const results = [];
// Stream step 1 and start step 2 as soon as we have enough
const stream1 = await llm.stream({ content: step1Prompt(input) });
let partialResult = '';
let step2Started = false;
let step2Promise;
for await (const chunk of stream1) {
partialResult += chunk;
results.push({ step: 1, chunk });
// Start step 2 early if we have enough context
if (!step2Started && hasEnoughContext(partialResult)) {
step2Started = true;
step2Promise = processStep2(partialResult);
}
}
// Wait for step 2 if started, otherwise start now
const step2Result = step2Promise
? await step2Promise
: await processStep2(partialResult);
return { step1: partialResult, step2: step2Result };
}
3. Model Selection for Speed
Loading Prompt Playground...
4. Request Batching
class BatchProcessor {
constructor(batchSize = 10, maxWaitMs = 100) {
this.batchSize = batchSize;
this.maxWaitMs = maxWaitMs;
this.queue = [];
this.timer = null;
}
async add(input) {
return new Promise((resolve, reject) => {
this.queue.push({ input, resolve, reject });
if (this.queue.length >= this.batchSize) {
this.flush();
} else if (!this.timer) {
this.timer = setTimeout(() => this.flush(), this.maxWaitMs);
}
});
}
async flush() {
if (this.timer) {
clearTimeout(this.timer);
this.timer = null;
}
const batch = this.queue.splice(0, this.batchSize);
if (batch.length === 0) return;
try {
// Process entire batch in one API call
const results = await this.processBatch(batch.map(b => b.input));
batch.forEach((item, i) => item.resolve(results[i]));
} catch (error) {
batch.forEach(item => item.reject(error));
}
}
async processBatch(inputs) {
const prompt = `Process these ${inputs.length} items:\n${
inputs.map((input, i) => `${i + 1}. ${input}`).join('\n')
}\n\nReturn results as JSON array.`;
const response = await llm.chat({ content: prompt });
return JSON.parse(response);
}
}
5. Caching for Speed
class LRUCache {
constructor(maxSize = 1000) {
this.cache = new Map();
this.maxSize = maxSize;
}
get(key) {
if (!this.cache.has(key)) return null;
// Move to end (most recently used)
const value = this.cache.get(key);
this.cache.delete(key);
this.cache.set(key, value);
return value;
}
set(key, value) {
if (this.cache.has(key)) {
this.cache.delete(key);
} else if (this.cache.size >= this.maxSize) {
// Remove oldest (first item)
const firstKey = this.cache.keys().next().value;
this.cache.delete(firstKey);
}
this.cache.set(key, value);
}
}
// With cache hit, response is instant
async function cachedStep(input, cache, computeFn) {
const key = hashInput(input);
const cached = cache.get(key);
if (cached) {
return cached; // 0ms latency
}
const result = await computeFn(input);
cache.set(key, result);
return result;
}
Latency Monitoring
Measuring Chain Performance
class LatencyMonitor {
constructor() {
this.metrics = [];
}
async measure(name, fn) {
const start = performance.now();
try {
const result = await fn();
const duration = performance.now() - start;
this.metrics.push({
name,
duration,
success: true,
timestamp: Date.now()
});
return result;
} catch (error) {
const duration = performance.now() - start;
this.metrics.push({
name,
duration,
success: false,
error: error.message,
timestamp: Date.now()
});
throw error;
}
}
getStats(name) {
const relevant = this.metrics.filter(m => m.name === name);
const durations = relevant.map(m => m.duration);
return {
count: relevant.length,
avg: durations.reduce((a, b) => a + b, 0) / durations.length,
min: Math.min(...durations),
max: Math.max(...durations),
p50: percentile(durations, 50),
p95: percentile(durations, 95),
p99: percentile(durations, 99)
};
}
}
// Usage
const monitor = new LatencyMonitor();
async function monitoredChain(input) {
const step1 = await monitor.measure('step1', () => runStep1(input));
const step2 = await monitor.measure('step2', () => runStep2(step1));
const step3 = await monitor.measure('step3', () => runStep3(step2));
return step3;
}
Exercise: Optimize Chain Latency
Loading Prompt Playground...
Key Takeaways
- Chain latency compounds with each sequential step
- Identify and parallelize independent steps
- Use streaming to reduce perceived latency
- Select faster models for simpler tasks
- Batch requests when possible
- Cache frequently repeated operations
- Monitor latency metrics continuously
- Set SLAs and design chains to meet them
Next, we'll cover monitoring and debugging chains.

