The Generation Phase: Calling Gemini
Introduction
We've retrieved context and crafted our prompt. Now we call the LLM to generate a response. This lesson covers the generation phase—making API calls to Gemini, understanding the response structure, and implementing streaming for real-time token delivery.
The generation phase is where RAG comes together: grounded context meets intelligent synthesis, producing answers that are both accurate and well-articulated.
Gemini API Fundamentals
Model Selection
Google offers several Gemini models:
| Model | Best For | Context Window | Speed |
|---|---|---|---|
| gemini-1.5-flash | Fast responses, cost-effective | 128K tokens | Fastest |
| gemini-1.5-pro | Complex reasoning, accuracy | 128K tokens | Moderate |
| gemini-1.0-pro | Legacy applications | 32K tokens | Fast |
For RAG applications, gemini-1.5-flash is often the best choice—it's fast, cost-effective, and capable enough for most documentation Q&A.
API Structure
The Gemini API uses a conversational structure:
import { GoogleGenerativeAI } from '@google/generative-ai';
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-flash' });
// Simple generation
const result = await model.generateContent({
contents: [
{
role: 'user',
parts: [{ text: prompt }]
}
],
generationConfig: {
temperature: 0.1,
maxOutputTokens: 1024
}
});
const response = result.response.text();
System Instructions
Gemini supports system instructions as a separate parameter:
const model = genAI.getGenerativeModel({
model: 'gemini-1.5-flash',
systemInstruction: `You are a documentation assistant. Answer questions using ONLY the provided context. If the context doesn't contain the answer, say "I don't have information about that."`
});
const result = await model.generateContent({
contents: [
{
role: 'user',
parts: [{ text: `CONTEXT:\n${context}\n\nQUESTION: ${query}` }]
}
]
});
Benefit of separate system instruction:
- Clearer API structure
- System instruction isn't counted against user message tokens in some scenarios
- Easier to manage in code
Building the Generation Request
Complete Request Structure
interface GenerationRequest {
model: string;
systemInstruction: string;
context: string;
query: string;
config: {
temperature: number;
maxOutputTokens: number;
topK: number;
topP: number;
};
}
async function generateResponse(req: GenerationRequest): Promise<string> {
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({
model: req.model,
systemInstruction: req.systemInstruction
});
const prompt = `CONTEXT:
${req.context}
---
USER QUESTION: ${req.query}`;
const result = await model.generateContent({
contents: [{ role: 'user', parts: [{ text: prompt }] }],
generationConfig: {
temperature: req.config.temperature,
maxOutputTokens: req.config.maxOutputTokens,
topK: req.config.topK,
topP: req.config.topP
}
});
return result.response.text();
}
Default Configuration
const defaultConfig = {
temperature: 0.1, // Low for factual accuracy
maxOutputTokens: 1024, // Reasonable response length
topK: 40, // Standard diversity
topP: 0.95 // Nucleus sampling threshold
};
The Power of Streaming
Why Streaming Matters
Non-streaming (batch) generation waits for the entire response before returning:
User sends query ──────────────────────────────────►
[LLM generating...]
[still generating...]
[done!]
◄────────────────────────────────── Complete response
Total wait: 2-5 seconds before user sees anything.
Streaming returns tokens as they're generated:
User sends query ─────►
◄─ "To"
◄─ " configure"
◄─ " authentication"
◄─ ","
◄─ " you"
◄─ " need"
...
◄─ "[end]"
User sees the first token in ~100ms, creating a much better experience.
Implementing Streaming with Gemini
async function generateStreamingResponse(
prompt: string,
systemInstruction: string
): Promise<ReadableStream> {
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({
model: 'gemini-1.5-flash',
systemInstruction
});
const result = await model.generateContentStream({
contents: [{ role: 'user', parts: [{ text: prompt }] }],
generationConfig: {
temperature: 0.1,
maxOutputTokens: 1024
}
});
// Convert to Web Streams API
const encoder = new TextEncoder();
return new ReadableStream({
async start(controller) {
try {
for await (const chunk of result.stream) {
const text = chunk.text();
if (text) {
controller.enqueue(encoder.encode(text));
}
}
controller.close();
} catch (error) {
controller.error(error);
}
}
});
}
Next.js Streaming Response
// app/api/chat/route.ts
import { GoogleGenerativeAI } from '@google/generative-ai';
export async function POST(request: Request) {
const { message } = await request.json();
// 1. Get query embedding
const queryEmbedding = await embedQuery(message);
// 2. Search for context
const { data: docs } = await supabase.rpc('search_docs', {
query_embedding: queryEmbedding,
match_count: 5
});
// 3. Build context
const context = docs
.map((d: any) => `[${d.source}]\n${d.content}`)
.join('\n\n---\n\n');
// 4. Generate streaming response
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({
model: 'gemini-1.5-flash',
systemInstruction: 'You are a documentation assistant. Answer using ONLY the provided context.'
});
const prompt = `CONTEXT:\n${context}\n\n---\n\nQUESTION: ${message}`;
const result = await model.generateContentStream({
contents: [{ role: 'user', parts: [{ text: prompt }] }],
generationConfig: { temperature: 0.1 }
});
// 5. Create streaming response
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
for await (const chunk of result.stream) {
const text = chunk.text();
if (text) {
controller.enqueue(encoder.encode(text));
}
}
controller.close();
}
});
return new Response(stream, {
headers: {
'Content-Type': 'text/plain; charset=utf-8',
'Transfer-Encoding': 'chunked'
}
});
}
Client-Side Stream Processing
// React component
async function sendMessage(message: string) {
setIsLoading(true);
setResponse('');
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message })
});
if (!response.body) {
throw new Error('No response body');
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
setResponse(prev => prev + chunk);
}
setIsLoading(false);
}
Server-Sent Events (SSE) Pattern
For more structured streaming, use Server-Sent Events:
Server Implementation
// app/api/chat/route.ts
export async function POST(request: Request) {
const { message } = await request.json();
// ... retrieval and context building ...
const result = await model.generateContentStream({
contents: [{ role: 'user', parts: [{ text: prompt }] }]
});
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
// Send initial metadata
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ type: 'start', sources: docs.map(d => d.source) })}\n\n`)
);
// Stream content
for await (const chunk of result.stream) {
const text = chunk.text();
if (text) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ type: 'content', text })}\n\n`)
);
}
}
// Send completion signal
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ type: 'done' })}\n\n`)
);
controller.close();
}
});
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive'
}
});
}
Client Implementation
async function sendMessageSSE(message: string) {
setIsLoading(true);
setResponse('');
setSources([]);
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message })
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
const lines = text.split('\n\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
switch (data.type) {
case 'start':
setSources(data.sources);
break;
case 'content':
setResponse(prev => prev + data.text);
break;
case 'done':
setIsLoading(false);
break;
}
}
}
}
}
Error Handling
Common Errors
Rate Limiting:
try {
const result = await model.generateContent(request);
} catch (error) {
if (error.message?.includes('429') || error.message?.includes('quota')) {
// Rate limited - implement backoff
await sleep(1000);
return generateWithRetry(request, retries - 1);
}
throw error;
}
Content Filtering:
const result = await model.generateContent(request);
// Check if response was blocked
if (result.response.promptFeedback?.blockReason) {
return {
error: 'Content was filtered',
reason: result.response.promptFeedback.blockReason
};
}
Token Limit Exceeded:
// Before calling
if (estimateTokens(prompt) > 128000) {
// Truncate context or summarize
context = truncateToTokenLimit(context, 100000);
}
Robust Generation Function
interface GenerationResult {
success: boolean;
text?: string;
error?: string;
sources?: string[];
}
async function safeGenerateResponse(
context: string,
query: string,
sources: string[]
): Promise<GenerationResult> {
try {
const model = genAI.getGenerativeModel({
model: 'gemini-1.5-flash',
systemInstruction: systemPrompt
});
const result = await model.generateContent({
contents: [{ role: 'user', parts: [{ text: `CONTEXT:\n${context}\n\nQUESTION: ${query}` }] }],
generationConfig: { temperature: 0.1, maxOutputTokens: 1024 }
});
// Check for content filtering
if (result.response.promptFeedback?.blockReason) {
return {
success: false,
error: `Response blocked: ${result.response.promptFeedback.blockReason}`
};
}
const text = result.response.text();
if (!text) {
return {
success: false,
error: 'Empty response from model'
};
}
return {
success: true,
text,
sources
};
} catch (error) {
console.error('Generation error:', error);
// Categorize error
if (error instanceof Error) {
if (error.message.includes('429')) {
return { success: false, error: 'Rate limit exceeded. Please try again later.' };
}
if (error.message.includes('quota')) {
return { success: false, error: 'API quota exceeded.' };
}
}
return { success: false, error: 'An unexpected error occurred.' };
}
}
Summary
In this lesson, we covered the generation phase of RAG:
Key Takeaways:
-
Choose the right model: gemini-1.5-flash for speed/cost, gemini-1.5-pro for complex reasoning
-
Streaming transforms user experience: First token in ~100ms vs 2-5 second wait
-
System instructions separate concerns: Cleaner code, better organization
-
Low temperature for RAG: 0.1-0.3 keeps responses factual and consistent
-
Handle errors gracefully: Rate limits, content filtering, token limits all need handling
Module 3 Complete
Congratulations! You've completed Module 3: The RAG Core. You now understand:
- The science of vector similarity search
- How to engineer prompts for grounded responses
- How to call Gemini and implement streaming
In Module 4, we'll build the Production-Ready Chat Architecture—frontend-backend communication, security considerations, and implementing attribution systems.
"The best interfaces disappear—they let the conversation flow naturally." — Unknown

