Latency and Performance Optimization

Chain latency compounds with each step. This lesson covers strategies for minimizing response times while maintaining quality.

Understanding Chain Latency

Total Latency = Σ(API Call Time) + Network Overhead + Processing Time

Sequential chains multiply latency. A 5-step chain with 2s per step = 10s+ total.

Latency Components

Breaking Down Response Time

// Typical latency breakdown for a single API call
const latencyComponents = {
  networkRoundTrip: '50-200ms',
  tokenGeneration: '20-50ms per token',
  promptProcessing: '100-500ms (varies with length)',
  modelLoading: '0ms (warm) to 2000ms (cold start)'
};

// For a chain
const chainLatency = {
  step1: 1500,  // ms
  step2: 2000,
  step3: 1800,
  step4: 2200,
  overhead: 200,
  total: 7700   // 7.7 seconds end-to-end
};

Optimization Strategies

1. Parallel Execution

Loading Prompt Playground...

Implementing Parallel Steps

async function parallelOptimizedChain(document) {
  // Phase 1: Independent steps run in parallel
  const [metadata, entities, docType, summary, tags] = await Promise.all([
    extractMetadata(document),      // 1.5s
    identifyEntities(document),     // 2.0s
    classifyDocument(document),     // 1.0s
    summarizeContent(document),     // 2.5s
    generateTags(document)          // 1.0s
  ]);
  // Phase 1 total: 2.5s (longest step)

  // Phase 2: Dependent step
  const report = await createReport({
    metadata, entities, docType, summary, tags
  });
  // Phase 2: 2.0s

  // Total: 4.5s instead of 10s (55% reduction)
  return report;
}

2. Streaming Responses

async function streamingChain(input) {
  const results = [];

  // Stream step 1 and start step 2 as soon as we have enough
  const stream1 = await llm.stream({ content: step1Prompt(input) });

  let partialResult = '';
  let step2Started = false;
  let step2Promise;

  for await (const chunk of stream1) {
    partialResult += chunk;
    results.push({ step: 1, chunk });

    // Start step 2 early if we have enough context
    if (!step2Started && hasEnoughContext(partialResult)) {
      step2Started = true;
      step2Promise = processStep2(partialResult);
    }
  }

  // Wait for step 2 if started, otherwise start now
  const step2Result = step2Promise
    ? await step2Promise
    : await processStep2(partialResult);

  return { step1: partialResult, step2: step2Result };
}

3. Model Selection for Speed

Loading Prompt Playground...

4. Request Batching

class BatchProcessor {
  constructor(batchSize = 10, maxWaitMs = 100) {
    this.batchSize = batchSize;
    this.maxWaitMs = maxWaitMs;
    this.queue = [];
    this.timer = null;
  }

  async add(input) {
    return new Promise((resolve, reject) => {
      this.queue.push({ input, resolve, reject });

      if (this.queue.length >= this.batchSize) {
        this.flush();
      } else if (!this.timer) {
        this.timer = setTimeout(() => this.flush(), this.maxWaitMs);
      }
    });
  }

  async flush() {
    if (this.timer) {
      clearTimeout(this.timer);
      this.timer = null;
    }

    const batch = this.queue.splice(0, this.batchSize);
    if (batch.length === 0) return;

    try {
      // Process entire batch in one API call
      const results = await this.processBatch(batch.map(b => b.input));

      batch.forEach((item, i) => item.resolve(results[i]));
    } catch (error) {
      batch.forEach(item => item.reject(error));
    }
  }

  async processBatch(inputs) {
    const prompt = `Process these ${inputs.length} items:\n${
      inputs.map((input, i) => `${i + 1}. ${input}`).join('\n')
    }\n\nReturn results as JSON array.`;

    const response = await llm.chat({ content: prompt });
    return JSON.parse(response);
  }
}

5. Caching for Speed

class LRUCache {
  constructor(maxSize = 1000) {
    this.cache = new Map();
    this.maxSize = maxSize;
  }

  get(key) {
    if (!this.cache.has(key)) return null;

    // Move to end (most recently used)
    const value = this.cache.get(key);
    this.cache.delete(key);
    this.cache.set(key, value);

    return value;
  }

  set(key, value) {
    if (this.cache.has(key)) {
      this.cache.delete(key);
    } else if (this.cache.size >= this.maxSize) {
      // Remove oldest (first item)
      const firstKey = this.cache.keys().next().value;
      this.cache.delete(firstKey);
    }

    this.cache.set(key, value);
  }
}

// With cache hit, response is instant
async function cachedStep(input, cache, computeFn) {
  const key = hashInput(input);
  const cached = cache.get(key);

  if (cached) {
    return cached; // 0ms latency
  }

  const result = await computeFn(input);
  cache.set(key, result);
  return result;
}

Latency Monitoring

Measuring Chain Performance

class LatencyMonitor {
  constructor() {
    this.metrics = [];
  }

  async measure(name, fn) {
    const start = performance.now();

    try {
      const result = await fn();
      const duration = performance.now() - start;

      this.metrics.push({
        name,
        duration,
        success: true,
        timestamp: Date.now()
      });

      return result;
    } catch (error) {
      const duration = performance.now() - start;

      this.metrics.push({
        name,
        duration,
        success: false,
        error: error.message,
        timestamp: Date.now()
      });

      throw error;
    }
  }

  getStats(name) {
    const relevant = this.metrics.filter(m => m.name === name);
    const durations = relevant.map(m => m.duration);

    return {
      count: relevant.length,
      avg: durations.reduce((a, b) => a + b, 0) / durations.length,
      min: Math.min(...durations),
      max: Math.max(...durations),
      p50: percentile(durations, 50),
      p95: percentile(durations, 95),
      p99: percentile(durations, 99)
    };
  }
}

// Usage
const monitor = new LatencyMonitor();

async function monitoredChain(input) {
  const step1 = await monitor.measure('step1', () => runStep1(input));
  const step2 = await monitor.measure('step2', () => runStep2(step1));
  const step3 = await monitor.measure('step3', () => runStep3(step2));

  return step3;
}

Exercise: Optimize Chain Latency

Loading Prompt Playground...

Key Takeaways

Chain latency compounds with each sequential step
Identify and parallelize independent steps
Use streaming to reduce perceived latency
Select faster models for simpler tasks
Batch requests when possible
Cache frequently repeated operations
Monitor latency metrics continuously
Set SLAs and design chains to meet them

Next, we'll cover monitoring and debugging chains.

Latency and Performance Optimization

Chain latency compounds with each step. This lesson covers strategies for minimizing response times while maintaining quality.

Understanding Chain Latency

Total Latency = Σ(API Call Time) + Network Overhead + Processing Time

Sequential chains multiply latency. A 5-step chain with 2s per step = 10s+ total.

Latency Components

Breaking Down Response Time

// Typical latency breakdown for a single API call
const latencyComponents = {
  networkRoundTrip: '50-200ms',
  tokenGeneration: '20-50ms per token',
  promptProcessing: '100-500ms (varies with length)',
  modelLoading: '0ms (warm) to 2000ms (cold start)'
};

// For a chain
const chainLatency = {
  step1: 1500,  // ms
  step2: 2000,
  step3: 1800,
  step4: 2200,
  overhead: 200,
  total: 7700   // 7.7 seconds end-to-end
};

Optimization Strategies

1. Parallel Execution

Loading Prompt Playground...

Implementing Parallel Steps

async function parallelOptimizedChain(document) {
  // Phase 1: Independent steps run in parallel
  const [metadata, entities, docType, summary, tags] = await Promise.all([
    extractMetadata(document),      // 1.5s
    identifyEntities(document),     // 2.0s
    classifyDocument(document),     // 1.0s
    summarizeContent(document),     // 2.5s
    generateTags(document)          // 1.0s
  ]);
  // Phase 1 total: 2.5s (longest step)

  // Phase 2: Dependent step
  const report = await createReport({
    metadata, entities, docType, summary, tags
  });
  // Phase 2: 2.0s

  // Total: 4.5s instead of 10s (55% reduction)
  return report;
}

2. Streaming Responses

async function streamingChain(input) {
  const results = [];

  // Stream step 1 and start step 2 as soon as we have enough
  const stream1 = await llm.stream({ content: step1Prompt(input) });

  let partialResult = '';
  let step2Started = false;
  let step2Promise;

  for await (const chunk of stream1) {
    partialResult += chunk;
    results.push({ step: 1, chunk });

    // Start step 2 early if we have enough context
    if (!step2Started && hasEnoughContext(partialResult)) {
      step2Started = true;
      step2Promise = processStep2(partialResult);
    }
  }

  // Wait for step 2 if started, otherwise start now
  const step2Result = step2Promise
    ? await step2Promise
    : await processStep2(partialResult);

  return { step1: partialResult, step2: step2Result };
}

3. Model Selection for Speed

Loading Prompt Playground...

4. Request Batching

class BatchProcessor {
  constructor(batchSize = 10, maxWaitMs = 100) {
    this.batchSize = batchSize;
    this.maxWaitMs = maxWaitMs;
    this.queue = [];
    this.timer = null;
  }

  async add(input) {
    return new Promise((resolve, reject) => {
      this.queue.push({ input, resolve, reject });

      if (this.queue.length >= this.batchSize) {
        this.flush();
      } else if (!this.timer) {
        this.timer = setTimeout(() => this.flush(), this.maxWaitMs);
      }
    });
  }

  async flush() {
    if (this.timer) {
      clearTimeout(this.timer);
      this.timer = null;
    }

    const batch = this.queue.splice(0, this.batchSize);
    if (batch.length === 0) return;

    try {
      // Process entire batch in one API call
      const results = await this.processBatch(batch.map(b => b.input));

      batch.forEach((item, i) => item.resolve(results[i]));
    } catch (error) {
      batch.forEach(item => item.reject(error));
    }
  }

  async processBatch(inputs) {
    const prompt = `Process these ${inputs.length} items:\n${
      inputs.map((input, i) => `${i + 1}. ${input}`).join('\n')
    }\n\nReturn results as JSON array.`;

    const response = await llm.chat({ content: prompt });
    return JSON.parse(response);
  }
}

5. Caching for Speed

class LRUCache {
  constructor(maxSize = 1000) {
    this.cache = new Map();
    this.maxSize = maxSize;
  }

  get(key) {
    if (!this.cache.has(key)) return null;

    // Move to end (most recently used)
    const value = this.cache.get(key);
    this.cache.delete(key);
    this.cache.set(key, value);

    return value;
  }

  set(key, value) {
    if (this.cache.has(key)) {
      this.cache.delete(key);
    } else if (this.cache.size >= this.maxSize) {
      // Remove oldest (first item)
      const firstKey = this.cache.keys().next().value;
      this.cache.delete(firstKey);
    }

    this.cache.set(key, value);
  }
}

// With cache hit, response is instant
async function cachedStep(input, cache, computeFn) {
  const key = hashInput(input);
  const cached = cache.get(key);

  if (cached) {
    return cached; // 0ms latency
  }

  const result = await computeFn(input);
  cache.set(key, result);
  return result;
}

Latency Monitoring

Measuring Chain Performance

class LatencyMonitor {
  constructor() {
    this.metrics = [];
  }

  async measure(name, fn) {
    const start = performance.now();

    try {
      const result = await fn();
      const duration = performance.now() - start;

      this.metrics.push({
        name,
        duration,
        success: true,
        timestamp: Date.now()
      });

      return result;
    } catch (error) {
      const duration = performance.now() - start;

      this.metrics.push({
        name,
        duration,
        success: false,
        error: error.message,
        timestamp: Date.now()
      });

      throw error;
    }
  }

  getStats(name) {
    const relevant = this.metrics.filter(m => m.name === name);
    const durations = relevant.map(m => m.duration);

    return {
      count: relevant.length,
      avg: durations.reduce((a, b) => a + b, 0) / durations.length,
      min: Math.min(...durations),
      max: Math.max(...durations),
      p50: percentile(durations, 50),
      p95: percentile(durations, 95),
      p99: percentile(durations, 99)
    };
  }
}

// Usage
const monitor = new LatencyMonitor();

async function monitoredChain(input) {
  const step1 = await monitor.measure('step1', () => runStep1(input));
  const step2 = await monitor.measure('step2', () => runStep2(step1));
  const step3 = await monitor.measure('step3', () => runStep3(step2));

  return step3;
}

Exercise: Optimize Chain Latency

Loading Prompt Playground...

Key Takeaways

Chain latency compounds with each sequential step
Identify and parallelize independent steps
Use streaming to reduce perceived latency
Select faster models for simpler tasks
Batch requests when possible
Cache frequently repeated operations
Monitor latency metrics continuously
Set SLAs and design chains to meet them

Next, we'll cover monitoring and debugging chains.

Latency and Performance Optimization

Understanding Chain Latency

Latency Components

Breaking Down Response Time

Optimization Strategies

1. Parallel Execution

Implementing Parallel Steps

2. Streaming Responses

3. Model Selection for Speed

4. Request Batching

5. Caching for Speed

Latency Monitoring

Measuring Chain Performance

Exercise: Optimize Chain Latency

Key Takeaways

Discussion

Latency and Performance Optimization

Understanding Chain Latency

Latency Components

Breaking Down Response Time

Optimization Strategies

1. Parallel Execution

Implementing Parallel Steps

2. Streaming Responses

3. Model Selection for Speed

4. Request Batching

5. Caching for Speed

Latency Monitoring

Measuring Chain Performance

Exercise: Optimize Chain Latency

Key Takeaways

Discussion