The Generation Phase: Calling Gemini

Introduction

We've retrieved context and crafted our prompt. Now we call the LLM to generate a response. This lesson covers the generation phase—making API calls to Gemini, understanding the response structure, and implementing streaming for real-time token delivery.

The generation phase is where RAG comes together: grounded context meets intelligent synthesis, producing answers that are both accurate and well-articulated.

Gemini API Fundamentals

Model Selection

Google offers several Gemini models:

Model	Best For	Context Window	Speed
gemini-1.5-flash	Fast responses, cost-effective	128K tokens	Fastest
gemini-1.5-pro	Complex reasoning, accuracy	128K tokens	Moderate
gemini-1.0-pro	Legacy applications	32K tokens	Fast

For RAG applications, gemini-1.5-flash is often the best choice—it's fast, cost-effective, and capable enough for most documentation Q&A.

API Structure

The Gemini API uses a conversational structure:

import { GoogleGenerativeAI } from '@google/generative-ai';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-flash' });

// Simple generation
const result = await model.generateContent({
  contents: [
    {
      role: 'user',
      parts: [{ text: prompt }]
    }
  ],
  generationConfig: {
    temperature: 0.1,
    maxOutputTokens: 1024
  }
});

const response = result.response.text();

System Instructions

Gemini supports system instructions as a separate parameter:

const model = genAI.getGenerativeModel({
  model: 'gemini-1.5-flash',
  systemInstruction: `You are a documentation assistant. Answer questions using ONLY the provided context. If the context doesn't contain the answer, say "I don't have information about that."`
});

const result = await model.generateContent({
  contents: [
    {
      role: 'user',
      parts: [{ text: `CONTEXT:\n${context}\n\nQUESTION: ${query}` }]
    }
  ]
});

Benefit of separate system instruction:

Clearer API structure
System instruction isn't counted against user message tokens in some scenarios
Easier to manage in code

Building the Generation Request

Complete Request Structure

interface GenerationRequest {
  model: string;
  systemInstruction: string;
  context: string;
  query: string;
  config: {
    temperature: number;
    maxOutputTokens: number;
    topK: number;
    topP: number;
  };
}

async function generateResponse(req: GenerationRequest): Promise<string> {
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

  const model = genAI.getGenerativeModel({
    model: req.model,
    systemInstruction: req.systemInstruction
  });

  const prompt = `CONTEXT:
${req.context}

---

USER QUESTION: ${req.query}`;

  const result = await model.generateContent({
    contents: [{ role: 'user', parts: [{ text: prompt }] }],
    generationConfig: {
      temperature: req.config.temperature,
      maxOutputTokens: req.config.maxOutputTokens,
      topK: req.config.topK,
      topP: req.config.topP
    }
  });

  return result.response.text();
}

Default Configuration

const defaultConfig = {
  temperature: 0.1,       // Low for factual accuracy
  maxOutputTokens: 1024,  // Reasonable response length
  topK: 40,               // Standard diversity
  topP: 0.95              // Nucleus sampling threshold
};

The Power of Streaming

Why Streaming Matters

Non-streaming (batch) generation waits for the entire response before returning:

User sends query ──────────────────────────────────►
                              [LLM generating...]
                              [still generating...]
                              [done!]
◄────────────────────────────────── Complete response

Total wait: 2-5 seconds before user sees anything.

Streaming returns tokens as they're generated:

User sends query ─────►
◄─ "To"
◄─ " configure"
◄─ " authentication"
◄─ ","
◄─ " you"
◄─ " need"
...
◄─ "[end]"

User sees the first token in ~100ms, creating a much better experience.

Implementing Streaming with Gemini

async function generateStreamingResponse(
  prompt: string,
  systemInstruction: string
): Promise<ReadableStream> {
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

  const model = genAI.getGenerativeModel({
    model: 'gemini-1.5-flash',
    systemInstruction
  });

  const result = await model.generateContentStream({
    contents: [{ role: 'user', parts: [{ text: prompt }] }],
    generationConfig: {
      temperature: 0.1,
      maxOutputTokens: 1024
    }
  });

  // Convert to Web Streams API
  const encoder = new TextEncoder();

  return new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of result.stream) {
          const text = chunk.text();
          if (text) {
            controller.enqueue(encoder.encode(text));
          }
        }
        controller.close();
      } catch (error) {
        controller.error(error);
      }
    }
  });
}

Next.js Streaming Response

// app/api/chat/route.ts
import { GoogleGenerativeAI } from '@google/generative-ai';

export async function POST(request: Request) {
  const { message } = await request.json();

  // 1. Get query embedding
  const queryEmbedding = await embedQuery(message);

  // 2. Search for context
  const { data: docs } = await supabase.rpc('search_docs', {
    query_embedding: queryEmbedding,
    match_count: 5
  });

  // 3. Build context
  const context = docs
    .map((d: any) => `[${d.source}]\n${d.content}`)
    .join('\n\n---\n\n');

  // 4. Generate streaming response
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
  const model = genAI.getGenerativeModel({
    model: 'gemini-1.5-flash',
    systemInstruction: 'You are a documentation assistant. Answer using ONLY the provided context.'
  });

  const prompt = `CONTEXT:\n${context}\n\n---\n\nQUESTION: ${message}`;

  const result = await model.generateContentStream({
    contents: [{ role: 'user', parts: [{ text: prompt }] }],
    generationConfig: { temperature: 0.1 }
  });

  // 5. Create streaming response
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      for await (const chunk of result.stream) {
        const text = chunk.text();
        if (text) {
          controller.enqueue(encoder.encode(text));
        }
      }
      controller.close();
    }
  });

  return new Response(stream, {
    headers: {
      'Content-Type': 'text/plain; charset=utf-8',
      'Transfer-Encoding': 'chunked'
    }
  });
}

Client-Side Stream Processing

// React component
async function sendMessage(message: string) {
  setIsLoading(true);
  setResponse('');

  const response = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  });

  if (!response.body) {
    throw new Error('No response body');
  }

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value, { stream: true });
    setResponse(prev => prev + chunk);
  }

  setIsLoading(false);
}

Server-Sent Events (SSE) Pattern

For more structured streaming, use Server-Sent Events:

Server Implementation

// app/api/chat/route.ts
export async function POST(request: Request) {
  const { message } = await request.json();

  // ... retrieval and context building ...

  const result = await model.generateContentStream({
    contents: [{ role: 'user', parts: [{ text: prompt }] }]
  });

  const encoder = new TextEncoder();

  const stream = new ReadableStream({
    async start(controller) {
      // Send initial metadata
      controller.enqueue(
        encoder.encode(`data: ${JSON.stringify({ type: 'start', sources: docs.map(d => d.source) })}\n\n`)
      );

      // Stream content
      for await (const chunk of result.stream) {
        const text = chunk.text();
        if (text) {
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify({ type: 'content', text })}\n\n`)
          );
        }
      }

      // Send completion signal
      controller.enqueue(
        encoder.encode(`data: ${JSON.stringify({ type: 'done' })}\n\n`)
      );
      controller.close();
    }
  });

  return new Response(stream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive'
    }
  });
}

Client Implementation

async function sendMessageSSE(message: string) {
  setIsLoading(true);
  setResponse('');
  setSources([]);

  const response = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const text = decoder.decode(value);
    const lines = text.split('\n\n');

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6));

        switch (data.type) {
          case 'start':
            setSources(data.sources);
            break;
          case 'content':
            setResponse(prev => prev + data.text);
            break;
          case 'done':
            setIsLoading(false);
            break;
        }
      }
    }
  }
}

Error Handling

Common Errors

Rate Limiting:

try {
  const result = await model.generateContent(request);
} catch (error) {
  if (error.message?.includes('429') || error.message?.includes('quota')) {
    // Rate limited - implement backoff
    await sleep(1000);
    return generateWithRetry(request, retries - 1);
  }
  throw error;
}

Content Filtering:

const result = await model.generateContent(request);

// Check if response was blocked
if (result.response.promptFeedback?.blockReason) {
  return {
    error: 'Content was filtered',
    reason: result.response.promptFeedback.blockReason
  };
}

Token Limit Exceeded:

// Before calling
if (estimateTokens(prompt) > 128000) {
  // Truncate context or summarize
  context = truncateToTokenLimit(context, 100000);
}

Robust Generation Function

interface GenerationResult {
  success: boolean;
  text?: string;
  error?: string;
  sources?: string[];
}

async function safeGenerateResponse(
  context: string,
  query: string,
  sources: string[]
): Promise<GenerationResult> {
  try {
    const model = genAI.getGenerativeModel({
      model: 'gemini-1.5-flash',
      systemInstruction: systemPrompt
    });

    const result = await model.generateContent({
      contents: [{ role: 'user', parts: [{ text: `CONTEXT:\n${context}\n\nQUESTION: ${query}` }] }],
      generationConfig: { temperature: 0.1, maxOutputTokens: 1024 }
    });

    // Check for content filtering
    if (result.response.promptFeedback?.blockReason) {
      return {
        success: false,
        error: `Response blocked: ${result.response.promptFeedback.blockReason}`
      };
    }

    const text = result.response.text();
    if (!text) {
      return {
        success: false,
        error: 'Empty response from model'
      };
    }

    return {
      success: true,
      text,
      sources
    };

  } catch (error) {
    console.error('Generation error:', error);

    // Categorize error
    if (error instanceof Error) {
      if (error.message.includes('429')) {
        return { success: false, error: 'Rate limit exceeded. Please try again later.' };
      }
      if (error.message.includes('quota')) {
        return { success: false, error: 'API quota exceeded.' };
      }
    }

    return { success: false, error: 'An unexpected error occurred.' };
  }
}

Summary

In this lesson, we covered the generation phase of RAG:

Key Takeaways:

Choose the right model: gemini-1.5-flash for speed/cost, gemini-1.5-pro for complex reasoning
Streaming transforms user experience: First token in ~100ms vs 2-5 second wait
System instructions separate concerns: Cleaner code, better organization
Low temperature for RAG: 0.1-0.3 keeps responses factual and consistent
Handle errors gracefully: Rate limits, content filtering, token limits all need handling

Module 3 Complete

Congratulations! You've completed Module 3: The RAG Core. You now understand:

The science of vector similarity search
How to engineer prompts for grounded responses
How to call Gemini and implement streaming

In Module 4, we'll build the Production-Ready Chat Architecture—frontend-backend communication, security considerations, and implementing attribution systems.

"The best interfaces disappear—they let the conversation flow naturally." — Unknown

The Generation Phase: Calling Gemini

Introduction

The generation phase is where RAG comes together: grounded context meets intelligent synthesis, producing answers that are both accurate and well-articulated.

Gemini API Fundamentals

Model Selection

Google offers several Gemini models:

Model	Best For	Context Window	Speed
gemini-1.5-flash	Fast responses, cost-effective	128K tokens	Fastest
gemini-1.5-pro	Complex reasoning, accuracy	128K tokens	Moderate
gemini-1.0-pro	Legacy applications	32K tokens	Fast

For RAG applications, gemini-1.5-flash is often the best choice—it's fast, cost-effective, and capable enough for most documentation Q&A.

API Structure

The Gemini API uses a conversational structure:

import { GoogleGenerativeAI } from '@google/generative-ai';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-flash' });

// Simple generation
const result = await model.generateContent({
  contents: [
    {
      role: 'user',
      parts: [{ text: prompt }]
    }
  ],
  generationConfig: {
    temperature: 0.1,
    maxOutputTokens: 1024
  }
});

const response = result.response.text();

System Instructions

Gemini supports system instructions as a separate parameter:

const model = genAI.getGenerativeModel({
  model: 'gemini-1.5-flash',
  systemInstruction: `You are a documentation assistant. Answer questions using ONLY the provided context. If the context doesn't contain the answer, say "I don't have information about that."`
});

const result = await model.generateContent({
  contents: [
    {
      role: 'user',
      parts: [{ text: `CONTEXT:\n${context}\n\nQUESTION: ${query}` }]
    }
  ]
});

Benefit of separate system instruction:

Clearer API structure
System instruction isn't counted against user message tokens in some scenarios
Easier to manage in code

Building the Generation Request

Complete Request Structure

interface GenerationRequest {
  model: string;
  systemInstruction: string;
  context: string;
  query: string;
  config: {
    temperature: number;
    maxOutputTokens: number;
    topK: number;
    topP: number;
  };
}

async function generateResponse(req: GenerationRequest): Promise<string> {
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

  const model = genAI.getGenerativeModel({
    model: req.model,
    systemInstruction: req.systemInstruction
  });

  const prompt = `CONTEXT:
${req.context}

---

USER QUESTION: ${req.query}`;

  const result = await model.generateContent({
    contents: [{ role: 'user', parts: [{ text: prompt }] }],
    generationConfig: {
      temperature: req.config.temperature,
      maxOutputTokens: req.config.maxOutputTokens,
      topK: req.config.topK,
      topP: req.config.topP
    }
  });

  return result.response.text();
}

Default Configuration

const defaultConfig = {
  temperature: 0.1,       // Low for factual accuracy
  maxOutputTokens: 1024,  // Reasonable response length
  topK: 40,               // Standard diversity
  topP: 0.95              // Nucleus sampling threshold
};

The Power of Streaming

Why Streaming Matters

Non-streaming (batch) generation waits for the entire response before returning:

User sends query ──────────────────────────────────►
                              [LLM generating...]
                              [still generating...]
                              [done!]
◄────────────────────────────────── Complete response

Total wait: 2-5 seconds before user sees anything.

Streaming returns tokens as they're generated:

User sends query ─────►
◄─ "To"
◄─ " configure"
◄─ " authentication"
◄─ ","
◄─ " you"
◄─ " need"
...
◄─ "[end]"

User sees the first token in ~100ms, creating a much better experience.

Implementing Streaming with Gemini

async function generateStreamingResponse(
  prompt: string,
  systemInstruction: string
): Promise<ReadableStream> {
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

  const model = genAI.getGenerativeModel({
    model: 'gemini-1.5-flash',
    systemInstruction
  });

  const result = await model.generateContentStream({
    contents: [{ role: 'user', parts: [{ text: prompt }] }],
    generationConfig: {
      temperature: 0.1,
      maxOutputTokens: 1024
    }
  });

  // Convert to Web Streams API
  const encoder = new TextEncoder();

  return new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of result.stream) {
          const text = chunk.text();
          if (text) {
            controller.enqueue(encoder.encode(text));
          }
        }
        controller.close();
      } catch (error) {
        controller.error(error);
      }
    }
  });
}

Next.js Streaming Response

// app/api/chat/route.ts
import { GoogleGenerativeAI } from '@google/generative-ai';

export async function POST(request: Request) {
  const { message } = await request.json();

  // 1. Get query embedding
  const queryEmbedding = await embedQuery(message);

  // 2. Search for context
  const { data: docs } = await supabase.rpc('search_docs', {
    query_embedding: queryEmbedding,
    match_count: 5
  });

  // 3. Build context
  const context = docs
    .map((d: any) => `[${d.source}]\n${d.content}`)
    .join('\n\n---\n\n');

  // 4. Generate streaming response
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
  const model = genAI.getGenerativeModel({
    model: 'gemini-1.5-flash',
    systemInstruction: 'You are a documentation assistant. Answer using ONLY the provided context.'
  });

  const prompt = `CONTEXT:\n${context}\n\n---\n\nQUESTION: ${message}`;

  const result = await model.generateContentStream({
    contents: [{ role: 'user', parts: [{ text: prompt }] }],
    generationConfig: { temperature: 0.1 }
  });

  // 5. Create streaming response
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      for await (const chunk of result.stream) {
        const text = chunk.text();
        if (text) {
          controller.enqueue(encoder.encode(text));
        }
      }
      controller.close();
    }
  });

  return new Response(stream, {
    headers: {
      'Content-Type': 'text/plain; charset=utf-8',
      'Transfer-Encoding': 'chunked'
    }
  });
}

Client-Side Stream Processing

// React component
async function sendMessage(message: string) {
  setIsLoading(true);
  setResponse('');

  const response = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  });

  if (!response.body) {
    throw new Error('No response body');
  }

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value, { stream: true });
    setResponse(prev => prev + chunk);
  }

  setIsLoading(false);
}

Server-Sent Events (SSE) Pattern

For more structured streaming, use Server-Sent Events:

Server Implementation

// app/api/chat/route.ts
export async function POST(request: Request) {
  const { message } = await request.json();

  // ... retrieval and context building ...

  const result = await model.generateContentStream({
    contents: [{ role: 'user', parts: [{ text: prompt }] }]
  });

  const encoder = new TextEncoder();

  const stream = new ReadableStream({
    async start(controller) {
      // Send initial metadata
      controller.enqueue(
        encoder.encode(`data: ${JSON.stringify({ type: 'start', sources: docs.map(d => d.source) })}\n\n`)
      );

      // Stream content
      for await (const chunk of result.stream) {
        const text = chunk.text();
        if (text) {
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify({ type: 'content', text })}\n\n`)
          );
        }
      }

      // Send completion signal
      controller.enqueue(
        encoder.encode(`data: ${JSON.stringify({ type: 'done' })}\n\n`)
      );
      controller.close();
    }
  });

  return new Response(stream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive'
    }
  });
}

Client Implementation

async function sendMessageSSE(message: string) {
  setIsLoading(true);
  setResponse('');
  setSources([]);

  const response = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const text = decoder.decode(value);
    const lines = text.split('\n\n');

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6));

        switch (data.type) {
          case 'start':
            setSources(data.sources);
            break;
          case 'content':
            setResponse(prev => prev + data.text);
            break;
          case 'done':
            setIsLoading(false);
            break;
        }
      }
    }
  }
}

Error Handling

Common Errors

Rate Limiting:

try {
  const result = await model.generateContent(request);
} catch (error) {
  if (error.message?.includes('429') || error.message?.includes('quota')) {
    // Rate limited - implement backoff
    await sleep(1000);
    return generateWithRetry(request, retries - 1);
  }
  throw error;
}

Content Filtering:

const result = await model.generateContent(request);

// Check if response was blocked
if (result.response.promptFeedback?.blockReason) {
  return {
    error: 'Content was filtered',
    reason: result.response.promptFeedback.blockReason
  };
}

Token Limit Exceeded:

// Before calling
if (estimateTokens(prompt) > 128000) {
  // Truncate context or summarize
  context = truncateToTokenLimit(context, 100000);
}

Robust Generation Function

interface GenerationResult {
  success: boolean;
  text?: string;
  error?: string;
  sources?: string[];
}

async function safeGenerateResponse(
  context: string,
  query: string,
  sources: string[]
): Promise<GenerationResult> {
  try {
    const model = genAI.getGenerativeModel({
      model: 'gemini-1.5-flash',
      systemInstruction: systemPrompt
    });

    const result = await model.generateContent({
      contents: [{ role: 'user', parts: [{ text: `CONTEXT:\n${context}\n\nQUESTION: ${query}` }] }],
      generationConfig: { temperature: 0.1, maxOutputTokens: 1024 }
    });

    // Check for content filtering
    if (result.response.promptFeedback?.blockReason) {
      return {
        success: false,
        error: `Response blocked: ${result.response.promptFeedback.blockReason}`
      };
    }

    const text = result.response.text();
    if (!text) {
      return {
        success: false,
        error: 'Empty response from model'
      };
    }

    return {
      success: true,
      text,
      sources
    };

  } catch (error) {
    console.error('Generation error:', error);

    // Categorize error
    if (error instanceof Error) {
      if (error.message.includes('429')) {
        return { success: false, error: 'Rate limit exceeded. Please try again later.' };
      }
      if (error.message.includes('quota')) {
        return { success: false, error: 'API quota exceeded.' };
      }
    }

    return { success: false, error: 'An unexpected error occurred.' };
  }
}

Summary

In this lesson, we covered the generation phase of RAG:

Key Takeaways:

Choose the right model: gemini-1.5-flash for speed/cost, gemini-1.5-pro for complex reasoning
Streaming transforms user experience: First token in ~100ms vs 2-5 second wait
System instructions separate concerns: Cleaner code, better organization
Low temperature for RAG: 0.1-0.3 keeps responses factual and consistent
Handle errors gracefully: Rate limits, content filtering, token limits all need handling

Module 3 Complete

Congratulations! You've completed Module 3: The RAG Core. You now understand:

The science of vector similarity search
How to engineer prompts for grounded responses
How to call Gemini and implement streaming

In Module 4, we'll build the Production-Ready Chat Architecture—frontend-backend communication, security considerations, and implementing attribution systems.

"The best interfaces disappear—they let the conversation flow naturally." — Unknown

The Generation Phase: Calling Gemini

Introduction

Gemini API Fundamentals

Model Selection

API Structure

System Instructions

Building the Generation Request

Complete Request Structure

Default Configuration

The Power of Streaming

Why Streaming Matters

Implementing Streaming with Gemini

Next.js Streaming Response

Client-Side Stream Processing

Server-Sent Events (SSE) Pattern

Server Implementation

Client Implementation

Error Handling

Common Errors

Robust Generation Function

Summary

Module 3 Complete

Quiz

The Generation Phase: Calling Gemini

Introduction

Gemini API Fundamentals

Model Selection

API Structure

System Instructions

Building the Generation Request

Complete Request Structure

Default Configuration

The Power of Streaming

Why Streaming Matters

Implementing Streaming with Gemini

Next.js Streaming Response

Client-Side Stream Processing

Server-Sent Events (SSE) Pattern

Server Implementation

Client Implementation

Error Handling

Common Errors

Robust Generation Function

Summary

Module 3 Complete

Quiz