Understanding Retrieval-Augmented Generation (RAG)

Introduction

Large Language Models like GPT-4, Claude, and Gemini have transformed what's possible with AI. They can write code, summarize documents, answer questions, and hold conversations that feel remarkably human. But they have a fundamental limitation: they only know what they learned during training.

This lesson explores Retrieval-Augmented Generation (RAG)—the architectural pattern that solves this limitation by giving LLMs access to external knowledge. By the end of this lesson, you'll understand not just what RAG is, but why it's become the standard approach for building AI applications that need accurate, up-to-date, domain-specific knowledge.

The Problem: Why LLMs Need External Memory

The Knowledge Cutoff Problem

Every LLM has a knowledge cutoff date—the point in time when its training data ended. For example, if a model was trained on data through January 2024, it has no knowledge of events after that date. It can't tell you about new product releases, recent news, or updated regulations.

Example: Ask an LLM "What were Apple's Q3 2025 earnings?" and it will either:

Admit it doesn't know (best case)
Confidently make up numbers that sound plausible (worst case)

This isn't a bug—it's a fundamental characteristic of how these models work. They're trained once, then deployed. They don't learn from new information after training.

The Hallucination Problem

When LLMs don't have information, they don't always say "I don't know." Instead, they often generate responses that sound authoritative but are factually incorrect. This phenomenon is called hallucination.

Hallucinations are particularly dangerous because:

They're delivered with the same confidence as accurate information
They can include specific details that seem too precise to be made up
Users often can't distinguish hallucinated content from truth

Why do hallucinations occur?

LLMs are essentially sophisticated pattern completion systems. Given an input, they predict the most likely continuation based on patterns learned during training. When asked about something they don't know, they generate what sounds like a reasonable answer based on similar patterns they've seen—even if that answer is completely fabricated.

The Domain Knowledge Problem

Even when an LLM was trained on relevant information, it may not have deep knowledge of your specific domain. A general-purpose LLM might know basic facts about medicine, law, or your company's products, but it won't have the detailed, nuanced knowledge that a specialist would have.

Consider these scenarios:

A customer support chatbot needs to know your company's specific policies
A legal research tool needs to understand jurisdiction-specific regulations
A technical documentation assistant needs to know your software's API details

No general-purpose LLM, no matter how large, can have this specific knowledge built in.

What is RAG?

Retrieval-Augmented Generation (RAG) is an architectural pattern that addresses these limitations by combining two key capabilities:

Retrieval: Finding relevant information from an external knowledge base
Generation: Using an LLM to generate a response based on that retrieved information

The core idea is simple: instead of asking the LLM to answer from its training knowledge alone, we first retrieve relevant documents from our own data source, then ask the LLM to answer based on those specific documents.

The RAG Pipeline

Every RAG system follows a three-phase pipeline:

Phase 1: Indexing (Offline) Before users can ask questions, we need to prepare our knowledge base:

Collect documents (PDFs, web pages, markdown files, etc.)
Split documents into smaller chunks
Convert chunks into vector embeddings
Store embeddings in a vector database

Phase 2: Retrieval (At Query Time) When a user asks a question:

Convert the question into a vector embedding
Search the vector database for similar embeddings
Return the most relevant document chunks

Phase 3: Generation (At Query Time) With relevant context in hand:

Construct a prompt that includes the retrieved context
Send the prompt to the LLM
Generate a response grounded in the provided context

A Concrete Example

Let's trace through a practical example. Imagine you're building a chatbot for a software company's documentation.

Indexing Phase (done once, offline):

1. Load documentation files: getting-started.md, api-reference.md, troubleshooting.md
2. Split into chunks: ~50 chunks of 500-1000 characters each
3. Generate embedding for each chunk using Gemini's embedding model
4. Store in Supabase: chunks + embeddings + metadata (source file, title)

User Query: "How do I authenticate API requests?"

Retrieval Phase:

1. Convert query to embedding: [0.023, -0.145, 0.087, ...]
2. Search Supabase for similar embeddings
3. Top 3 results:
   - "Authentication uses Bearer tokens..." (similarity: 0.89)
   - "To generate an API key, go to..." (similarity: 0.85)
   - "All requests must include the Authorization header..." (similarity: 0.82)

Generation Phase:

Prompt to Gemini:
"You are a helpful documentation assistant. Answer the user's question
using ONLY the following context. If the context doesn't contain the
answer, say so.

CONTEXT:
[chunk 1: Authentication uses Bearer tokens...]
[chunk 2: To generate an API key, go to...]
[chunk 3: All requests must include the Authorization header...]

USER QUESTION: How do I authenticate API requests?"

Response: "To authenticate API requests, you need to use Bearer token
authentication. First, generate an API key from your dashboard settings.
Then, include the Authorization header in all requests with your token..."

The response is grounded—it's based on actual documentation, not the LLM's general training knowledge.

Grounded Generation: Why RAG Works

The key insight behind RAG is grounded generation. Instead of asking the LLM to recall information (which it may hallucinate), we provide the information explicitly and ask the LLM to synthesize and explain it.

The Power of Context

LLMs are excellent at:

Understanding natural language questions
Synthesizing information from multiple sources
Generating clear, well-structured explanations
Adapting tone and detail level to the audience

They're less reliable at:

Accurately recalling specific facts from training
Distinguishing what they know from what they're guessing
Providing up-to-date information

RAG plays to the LLM's strengths while compensating for its weaknesses. We use the LLM's language understanding and generation capabilities while supplying the factual information ourselves.

Verifiable and Attributable Answers

A well-designed RAG system provides attribution—linking each part of the response back to the source documents. This has several benefits:

For Users:

They can verify information by checking the source
They build appropriate trust in the system
They can explore related information

For Developers:

Easier to debug incorrect responses
Clear audit trail for compliance requirements
Ability to identify gaps in the knowledge base

For Organizations:

Reduced liability from AI-generated misinformation
Compliance with regulatory requirements
Quality control over AI outputs

RAG vs. Fine-Tuning vs. Prompt Engineering

RAG isn't the only way to give LLMs domain knowledge. Let's compare the alternatives:

Prompt Engineering

Approach: Include domain knowledge directly in the prompt.

Example:

"You are a customer service agent for Acme Corp. Our return policy is:
- 30 days for most items
- 90 days for electronics
- No returns on final sale items

User question: Can I return this item?"

Pros:

Simple to implement
No infrastructure required
Good for small, static knowledge bases

Cons:

Limited by context window size
Inefficient for large knowledge bases
Can't scale to thousands of documents

Fine-Tuning

Approach: Train a custom model on your specific data.

Pros:

Model "learns" your domain deeply
Faster inference (no retrieval step)
Can capture subtle patterns and style

Cons:

Expensive and time-consuming
Requires ML expertise
Model needs retraining when knowledge changes
Risk of catastrophic forgetting

RAG

Pros:

Scales to large knowledge bases
Easy to update (just add/modify documents)
Provides attribution
Works with any LLM
Cost-effective

Cons:

Adds latency (retrieval step)
Requires vector database infrastructure
Retrieval quality affects output quality

When to Use Each

Use Prompt Engineering when:

Knowledge base is small (fits in a prompt)
Information rarely changes
You need a quick solution

Use Fine-Tuning when:

You need the model to learn a specific style or format
Response latency is critical
You have consistent, stable training data

Use RAG when:

Knowledge base is large or frequently updated
Attribution is important
You need to combine multiple knowledge sources
You want flexibility to change LLM providers

In practice, RAG is the right choice for most production applications because it offers the best balance of capability, flexibility, and maintainability.

The Anatomy of a RAG Application

Let's map the RAG pipeline to our specific technology stack:

Indexing Layer

Components:

Document Loaders: Scripts that read your source documents
Text Splitters: Logic that chunks documents appropriately
Embedding Model: Gemini's text-embedding-004 for vectorization
Vector Store: Supabase with pgvector extension

Key Files:

/scripts/
  ingest.ts          # Main ingestion script
  chunkers.ts        # Document chunking logic
/lib/
  embeddings.ts      # Gemini embedding client
  supabase.ts        # Database operations

Retrieval Layer

Components:

Query Processor: Converts user questions to embeddings
Vector Search: Postgres function using pgvector operators
Result Ranker: Filters and orders results by relevance

Key Implementation:

-- Supabase RPC function for vector search
CREATE FUNCTION search_docs(query_embedding vector, match_count int)
RETURNS TABLE(id uuid, content text, similarity float)

Generation Layer

Components:

Context Builder: Assembles retrieved chunks into a prompt
LLM Client: Gemini API for text generation
Response Streamer: Handles token-by-token output

Key Files:

/app/api/
  chat/route.ts      # Main chat endpoint
/lib/
  gemini.ts          # Gemini client configuration
  prompts.ts         # System prompts and templates

User Interface Layer

Components:

Chat Interface: React components for user interaction
Loading States: Visual feedback during processing
Citation Display: Shows source documents

Summary

In this lesson, we've established the fundamental concepts behind RAG:

Key Takeaways:

LLMs have fundamental limitations: Knowledge cutoff, hallucination, and lack of domain-specific knowledge
RAG solves these limitations by combining retrieval (finding relevant information) with generation (synthesizing responses)
The RAG pipeline has three phases: Indexing (preparing the knowledge base), Retrieval (finding relevant context), and Generation (creating grounded responses)
Grounded generation is the key insight: Providing information explicitly rather than asking the LLM to recall it
RAG is usually the right choice for production applications because it's flexible, updatable, and provides attribution

Next Steps

In the next lesson, we'll dive deep into vector embeddings—the technology that makes semantic search possible. You'll understand how text is converted into numerical representations and why this enables finding "similar" content rather than just exact matches.

"The key to artificial intelligence has always been the representation." — Jeff Hawkins

Understanding Retrieval-Augmented Generation (RAG)

Introduction

The Problem: Why LLMs Need External Memory

The Knowledge Cutoff Problem

Example: Ask an LLM "What were Apple's Q3 2025 earnings?" and it will either:

Admit it doesn't know (best case)
Confidently make up numbers that sound plausible (worst case)

This isn't a bug—it's a fundamental characteristic of how these models work. They're trained once, then deployed. They don't learn from new information after training.

The Hallucination Problem

Hallucinations are particularly dangerous because:

They're delivered with the same confidence as accurate information
They can include specific details that seem too precise to be made up
Users often can't distinguish hallucinated content from truth

Why do hallucinations occur?

The Domain Knowledge Problem

Consider these scenarios:

A customer support chatbot needs to know your company's specific policies
A legal research tool needs to understand jurisdiction-specific regulations
A technical documentation assistant needs to know your software's API details

No general-purpose LLM, no matter how large, can have this specific knowledge built in.

What is RAG?

Retrieval-Augmented Generation (RAG) is an architectural pattern that addresses these limitations by combining two key capabilities:

Retrieval: Finding relevant information from an external knowledge base
Generation: Using an LLM to generate a response based on that retrieved information

The RAG Pipeline

Every RAG system follows a three-phase pipeline:

Phase 1: Indexing (Offline) Before users can ask questions, we need to prepare our knowledge base:

Collect documents (PDFs, web pages, markdown files, etc.)
Split documents into smaller chunks
Convert chunks into vector embeddings
Store embeddings in a vector database

Phase 2: Retrieval (At Query Time) When a user asks a question:

Convert the question into a vector embedding
Search the vector database for similar embeddings
Return the most relevant document chunks

Phase 3: Generation (At Query Time) With relevant context in hand:

Construct a prompt that includes the retrieved context
Send the prompt to the LLM
Generate a response grounded in the provided context

A Concrete Example

Let's trace through a practical example. Imagine you're building a chatbot for a software company's documentation.

Indexing Phase (done once, offline):

1. Load documentation files: getting-started.md, api-reference.md, troubleshooting.md
2. Split into chunks: ~50 chunks of 500-1000 characters each
3. Generate embedding for each chunk using Gemini's embedding model
4. Store in Supabase: chunks + embeddings + metadata (source file, title)

User Query: "How do I authenticate API requests?"

Retrieval Phase:

1. Convert query to embedding: [0.023, -0.145, 0.087, ...]
2. Search Supabase for similar embeddings
3. Top 3 results:
   - "Authentication uses Bearer tokens..." (similarity: 0.89)
   - "To generate an API key, go to..." (similarity: 0.85)
   - "All requests must include the Authorization header..." (similarity: 0.82)

Generation Phase:

Prompt to Gemini:
"You are a helpful documentation assistant. Answer the user's question
using ONLY the following context. If the context doesn't contain the
answer, say so.

CONTEXT:
[chunk 1: Authentication uses Bearer tokens...]
[chunk 2: To generate an API key, go to...]
[chunk 3: All requests must include the Authorization header...]

USER QUESTION: How do I authenticate API requests?"

Response: "To authenticate API requests, you need to use Bearer token
authentication. First, generate an API key from your dashboard settings.
Then, include the Authorization header in all requests with your token..."

The response is grounded—it's based on actual documentation, not the LLM's general training knowledge.

Grounded Generation: Why RAG Works

The Power of Context

LLMs are excellent at:

Understanding natural language questions
Synthesizing information from multiple sources
Generating clear, well-structured explanations
Adapting tone and detail level to the audience

They're less reliable at:

Accurately recalling specific facts from training
Distinguishing what they know from what they're guessing
Providing up-to-date information

RAG plays to the LLM's strengths while compensating for its weaknesses. We use the LLM's language understanding and generation capabilities while supplying the factual information ourselves.

Verifiable and Attributable Answers

A well-designed RAG system provides attribution—linking each part of the response back to the source documents. This has several benefits:

For Users:

They can verify information by checking the source
They build appropriate trust in the system
They can explore related information

For Developers:

Easier to debug incorrect responses
Clear audit trail for compliance requirements
Ability to identify gaps in the knowledge base

For Organizations:

Reduced liability from AI-generated misinformation
Compliance with regulatory requirements
Quality control over AI outputs

RAG vs. Fine-Tuning vs. Prompt Engineering

RAG isn't the only way to give LLMs domain knowledge. Let's compare the alternatives:

Prompt Engineering

Approach: Include domain knowledge directly in the prompt.

Example:

"You are a customer service agent for Acme Corp. Our return policy is:
- 30 days for most items
- 90 days for electronics
- No returns on final sale items

User question: Can I return this item?"

Pros:

Simple to implement
No infrastructure required
Good for small, static knowledge bases

Cons:

Limited by context window size
Inefficient for large knowledge bases
Can't scale to thousands of documents

Fine-Tuning

Approach: Train a custom model on your specific data.

Pros:

Model "learns" your domain deeply
Faster inference (no retrieval step)
Can capture subtle patterns and style

Cons:

Expensive and time-consuming
Requires ML expertise
Model needs retraining when knowledge changes
Risk of catastrophic forgetting

RAG

Pros:

Scales to large knowledge bases
Easy to update (just add/modify documents)
Provides attribution
Works with any LLM
Cost-effective

Cons:

Adds latency (retrieval step)
Requires vector database infrastructure
Retrieval quality affects output quality

When to Use Each

Use Prompt Engineering when:

Knowledge base is small (fits in a prompt)
Information rarely changes
You need a quick solution

Use Fine-Tuning when:

You need the model to learn a specific style or format
Response latency is critical
You have consistent, stable training data

Use RAG when:

Knowledge base is large or frequently updated
Attribution is important
You need to combine multiple knowledge sources
You want flexibility to change LLM providers

In practice, RAG is the right choice for most production applications because it offers the best balance of capability, flexibility, and maintainability.

The Anatomy of a RAG Application

Let's map the RAG pipeline to our specific technology stack:

Indexing Layer

Components:

Document Loaders: Scripts that read your source documents
Text Splitters: Logic that chunks documents appropriately
Embedding Model: Gemini's text-embedding-004 for vectorization
Vector Store: Supabase with pgvector extension

Key Files:

/scripts/
  ingest.ts          # Main ingestion script
  chunkers.ts        # Document chunking logic
/lib/
  embeddings.ts      # Gemini embedding client
  supabase.ts        # Database operations

Retrieval Layer

Components:

Query Processor: Converts user questions to embeddings
Vector Search: Postgres function using pgvector operators
Result Ranker: Filters and orders results by relevance

Key Implementation:

-- Supabase RPC function for vector search
CREATE FUNCTION search_docs(query_embedding vector, match_count int)
RETURNS TABLE(id uuid, content text, similarity float)

Generation Layer

Components:

Context Builder: Assembles retrieved chunks into a prompt
LLM Client: Gemini API for text generation
Response Streamer: Handles token-by-token output

Key Files:

/app/api/
  chat/route.ts      # Main chat endpoint
/lib/
  gemini.ts          # Gemini client configuration
  prompts.ts         # System prompts and templates

User Interface Layer

Components:

Chat Interface: React components for user interaction
Loading States: Visual feedback during processing
Citation Display: Shows source documents

Summary

In this lesson, we've established the fundamental concepts behind RAG:

Key Takeaways:

LLMs have fundamental limitations: Knowledge cutoff, hallucination, and lack of domain-specific knowledge
RAG solves these limitations by combining retrieval (finding relevant information) with generation (synthesizing responses)
The RAG pipeline has three phases: Indexing (preparing the knowledge base), Retrieval (finding relevant context), and Generation (creating grounded responses)
Grounded generation is the key insight: Providing information explicitly rather than asking the LLM to recall it
RAG is usually the right choice for production applications because it's flexible, updatable, and provides attribution

Next Steps

"The key to artificial intelligence has always been the representation." — Jeff Hawkins