Understanding Retrieval-Augmented Generation (RAG)
Introduction
Large Language Models like GPT-4, Claude, and Gemini have transformed what's possible with AI. They can write code, summarize documents, answer questions, and hold conversations that feel remarkably human. But they have a fundamental limitation: they only know what they learned during training.
This lesson explores Retrieval-Augmented Generation (RAG)—the architectural pattern that solves this limitation by giving LLMs access to external knowledge. By the end of this lesson, you'll understand not just what RAG is, but why it's become the standard approach for building AI applications that need accurate, up-to-date, domain-specific knowledge.
The Problem: Why LLMs Need External Memory
The Knowledge Cutoff Problem
Every LLM has a knowledge cutoff date—the point in time when its training data ended. For example, if a model was trained on data through January 2024, it has no knowledge of events after that date. It can't tell you about new product releases, recent news, or updated regulations.
Example: Ask an LLM "What were Apple's Q3 2025 earnings?" and it will either:
- Admit it doesn't know (best case)
- Confidently make up numbers that sound plausible (worst case)
This isn't a bug—it's a fundamental characteristic of how these models work. They're trained once, then deployed. They don't learn from new information after training.
The Hallucination Problem
When LLMs don't have information, they don't always say "I don't know." Instead, they often generate responses that sound authoritative but are factually incorrect. This phenomenon is called hallucination.
Hallucinations are particularly dangerous because:
- They're delivered with the same confidence as accurate information
- They can include specific details that seem too precise to be made up
- Users often can't distinguish hallucinated content from truth
Why do hallucinations occur?
LLMs are essentially sophisticated pattern completion systems. Given an input, they predict the most likely continuation based on patterns learned during training. When asked about something they don't know, they generate what sounds like a reasonable answer based on similar patterns they've seen—even if that answer is completely fabricated.
The Domain Knowledge Problem
Even when an LLM was trained on relevant information, it may not have deep knowledge of your specific domain. A general-purpose LLM might know basic facts about medicine, law, or your company's products, but it won't have the detailed, nuanced knowledge that a specialist would have.
Consider these scenarios:
- A customer support chatbot needs to know your company's specific policies
- A legal research tool needs to understand jurisdiction-specific regulations
- A technical documentation assistant needs to know your software's API details
No general-purpose LLM, no matter how large, can have this specific knowledge built in.
What is RAG?
Retrieval-Augmented Generation (RAG) is an architectural pattern that addresses these limitations by combining two key capabilities:
- Retrieval: Finding relevant information from an external knowledge base
- Generation: Using an LLM to generate a response based on that retrieved information
The core idea is simple: instead of asking the LLM to answer from its training knowledge alone, we first retrieve relevant documents from our own data source, then ask the LLM to answer based on those specific documents.
The RAG Pipeline
Every RAG system follows a three-phase pipeline:
Phase 1: Indexing (Offline) Before users can ask questions, we need to prepare our knowledge base:
- Collect documents (PDFs, web pages, markdown files, etc.)
- Split documents into smaller chunks
- Convert chunks into vector embeddings
- Store embeddings in a vector database
Phase 2: Retrieval (At Query Time) When a user asks a question:
- Convert the question into a vector embedding
- Search the vector database for similar embeddings
- Return the most relevant document chunks
Phase 3: Generation (At Query Time) With relevant context in hand:
- Construct a prompt that includes the retrieved context
- Send the prompt to the LLM
- Generate a response grounded in the provided context
A Concrete Example
Let's trace through a practical example. Imagine you're building a chatbot for a software company's documentation.
Indexing Phase (done once, offline):
1. Load documentation files: getting-started.md, api-reference.md, troubleshooting.md
2. Split into chunks: ~50 chunks of 500-1000 characters each
3. Generate embedding for each chunk using Gemini's embedding model
4. Store in Supabase: chunks + embeddings + metadata (source file, title)
User Query: "How do I authenticate API requests?"
Retrieval Phase:
1. Convert query to embedding: [0.023, -0.145, 0.087, ...]
2. Search Supabase for similar embeddings
3. Top 3 results:
- "Authentication uses Bearer tokens..." (similarity: 0.89)
- "To generate an API key, go to..." (similarity: 0.85)
- "All requests must include the Authorization header..." (similarity: 0.82)
Generation Phase:
Prompt to Gemini:
"You are a helpful documentation assistant. Answer the user's question
using ONLY the following context. If the context doesn't contain the
answer, say so.
CONTEXT:
[chunk 1: Authentication uses Bearer tokens...]
[chunk 2: To generate an API key, go to...]
[chunk 3: All requests must include the Authorization header...]
USER QUESTION: How do I authenticate API requests?"
Response: "To authenticate API requests, you need to use Bearer token
authentication. First, generate an API key from your dashboard settings.
Then, include the Authorization header in all requests with your token..."
The response is grounded—it's based on actual documentation, not the LLM's general training knowledge.
Grounded Generation: Why RAG Works
The key insight behind RAG is grounded generation. Instead of asking the LLM to recall information (which it may hallucinate), we provide the information explicitly and ask the LLM to synthesize and explain it.
The Power of Context
LLMs are excellent at:
- Understanding natural language questions
- Synthesizing information from multiple sources
- Generating clear, well-structured explanations
- Adapting tone and detail level to the audience
They're less reliable at:
- Accurately recalling specific facts from training
- Distinguishing what they know from what they're guessing
- Providing up-to-date information
RAG plays to the LLM's strengths while compensating for its weaknesses. We use the LLM's language understanding and generation capabilities while supplying the factual information ourselves.
Verifiable and Attributable Answers
A well-designed RAG system provides attribution—linking each part of the response back to the source documents. This has several benefits:
For Users:
- They can verify information by checking the source
- They build appropriate trust in the system
- They can explore related information
For Developers:
- Easier to debug incorrect responses
- Clear audit trail for compliance requirements
- Ability to identify gaps in the knowledge base
For Organizations:
- Reduced liability from AI-generated misinformation
- Compliance with regulatory requirements
- Quality control over AI outputs
RAG vs. Fine-Tuning vs. Prompt Engineering
RAG isn't the only way to give LLMs domain knowledge. Let's compare the alternatives:
Prompt Engineering
Approach: Include domain knowledge directly in the prompt.
Example:
"You are a customer service agent for Acme Corp. Our return policy is:
- 30 days for most items
- 90 days for electronics
- No returns on final sale items
User question: Can I return this item?"
Pros:
- Simple to implement
- No infrastructure required
- Good for small, static knowledge bases
Cons:
- Limited by context window size
- Inefficient for large knowledge bases
- Can't scale to thousands of documents
Fine-Tuning
Approach: Train a custom model on your specific data.
Pros:
- Model "learns" your domain deeply
- Faster inference (no retrieval step)
- Can capture subtle patterns and style
Cons:
- Expensive and time-consuming
- Requires ML expertise
- Model needs retraining when knowledge changes
- Risk of catastrophic forgetting
RAG
Pros:
- Scales to large knowledge bases
- Easy to update (just add/modify documents)
- Provides attribution
- Works with any LLM
- Cost-effective
Cons:
- Adds latency (retrieval step)
- Requires vector database infrastructure
- Retrieval quality affects output quality
When to Use Each
Use Prompt Engineering when:
- Knowledge base is small (fits in a prompt)
- Information rarely changes
- You need a quick solution
Use Fine-Tuning when:
- You need the model to learn a specific style or format
- Response latency is critical
- You have consistent, stable training data
Use RAG when:
- Knowledge base is large or frequently updated
- Attribution is important
- You need to combine multiple knowledge sources
- You want flexibility to change LLM providers
In practice, RAG is the right choice for most production applications because it offers the best balance of capability, flexibility, and maintainability.
The Anatomy of a RAG Application
Let's map the RAG pipeline to our specific technology stack:
Indexing Layer
Components:
- Document Loaders: Scripts that read your source documents
- Text Splitters: Logic that chunks documents appropriately
- Embedding Model: Gemini's
text-embedding-004for vectorization - Vector Store: Supabase with pgvector extension
Key Files:
/scripts/
ingest.ts # Main ingestion script
chunkers.ts # Document chunking logic
/lib/
embeddings.ts # Gemini embedding client
supabase.ts # Database operations
Retrieval Layer
Components:
- Query Processor: Converts user questions to embeddings
- Vector Search: Postgres function using pgvector operators
- Result Ranker: Filters and orders results by relevance
Key Implementation:
-- Supabase RPC function for vector search
CREATE FUNCTION search_docs(query_embedding vector, match_count int)
RETURNS TABLE(id uuid, content text, similarity float)
Generation Layer
Components:
- Context Builder: Assembles retrieved chunks into a prompt
- LLM Client: Gemini API for text generation
- Response Streamer: Handles token-by-token output
Key Files:
/app/api/
chat/route.ts # Main chat endpoint
/lib/
gemini.ts # Gemini client configuration
prompts.ts # System prompts and templates
User Interface Layer
Components:
- Chat Interface: React components for user interaction
- Loading States: Visual feedback during processing
- Citation Display: Shows source documents
Summary
In this lesson, we've established the fundamental concepts behind RAG:
Key Takeaways:
-
LLMs have fundamental limitations: Knowledge cutoff, hallucination, and lack of domain-specific knowledge
-
RAG solves these limitations by combining retrieval (finding relevant information) with generation (synthesizing responses)
-
The RAG pipeline has three phases: Indexing (preparing the knowledge base), Retrieval (finding relevant context), and Generation (creating grounded responses)
-
Grounded generation is the key insight: Providing information explicitly rather than asking the LLM to recall it
-
RAG is usually the right choice for production applications because it's flexible, updatable, and provides attribution
Next Steps
In the next lesson, we'll dive deep into vector embeddings—the technology that makes semantic search possible. You'll understand how text is converted into numerical representations and why this enables finding "similar" content rather than just exact matches.
"The key to artificial intelligence has always been the representation." — Jeff Hawkins

