Module 4: Memory & Context (RAG)
Making it Smart
Introduction: Beyond Conversation History
So far, our agents have been limited to:
- Their training data (frozen in time)
- The current conversation
- Real-time tool calls
But what if you need your agent to:
- Remember user preferences across sessions
- Answer questions about your company's documentation
- Recall previous interactions from weeks ago
- Search through thousands of documents instantly
This is where memory and Retrieval-Augmented Generation (RAG) come in.
4.1 Short-term vs. Long-term Memory
Short-term Memory: Conversation History
This is what we've been using:
const messages: CoreMessage[] = [
{ role: 'system', content: 'You are a helpful assistant' },
{ role: 'user', content: 'What is TypeScript?' },
{ role: 'assistant', content: 'TypeScript is a typed superset of JavaScript...' },
{ role: 'user', content: 'How does it help with errors?' }
// The LLM can see all previous messages
]
Pros:
- Simple to implement
- No external storage needed
- Perfect for single-session tasks
Cons:
- Limited by context window (typically 4K-128K tokens)
- Expensive (pay for all tokens every request)
- Lost when session ends
Long-term Memory: Vector Storage
Store information persistently and retrieve relevant pieces on demand.
Use cases:
- Chat with your company's documentation
- Remember user preferences
- Build a personal knowledge base
- Search historical conversations
How it works:
1. EMBED: Convert text to vector (array of numbers)
"TypeScript is a typed superset" → [0.23, -0.45, 0.67, ...]
2. STORE: Save vectors in a database
Supabase pgvector, Pinecone, Weaviate, etc.
3. RETRIEVE: Find similar vectors when needed
Query: "What is TypeScript?" → Find relevant docs
4. AUGMENT: Add retrieved docs to LLM context
Generate answer using both the query and docs
Managing CoreMessage[] Arrays
Best practices for conversation history:
interface ConversationManager {
messages: CoreMessage[]
maxMessages: number
maxTokens: number
}
function trimMessages(messages: CoreMessage[], maxTokens: number): CoreMessage[] {
// Always keep system message
const system = messages.find(m => m.role === 'system')
const conversationMessages = messages.filter(m => m.role !== 'system')
// Estimate tokens (rough: 1 token ≈ 4 characters)
const estimateTokens = (text: string) => Math.ceil(text.length / 4)
let totalTokens = system ? estimateTokens(system.content) : 0
const kept: CoreMessage[] = []
// Keep most recent messages that fit in token budget
for (let i = conversationMessages.length - 1; i >= 0; i--) {
const msg = conversationMessages[i]
const tokens = estimateTokens(msg.content)
if (totalTokens + tokens <= maxTokens) {
kept.unshift(msg)
totalTokens += tokens
} else {
break
}
}
return system ? [system, ...kept] : kept
}
Summarization for Long Conversations
async function summarizeConversation(messages: CoreMessage[]): Promise<CoreMessage> {
const llm = new ChatOpenAI({ modelName: 'gpt-4-turbo' })
const summary = await llm.invoke([
{
role: 'user',
content: `Summarize this conversation in 2-3 sentences:\n\n${
messages.map(m => `${m.role}: ${m.content}`).join('\n')
}`
}
])
return {
role: 'system',
content: `Previous conversation summary: ${summary.content}`
}
}
// Use it
if (messages.length > 20) {
const summary = await summarizeConversation(messages.slice(0, -10))
messages = [summary, ...messages.slice(-10)]
}
4.2 The RAG Pipeline
What is RAG?
Retrieval-Augmented Generation combines:
- Retrieval: Finding relevant information from a knowledge base
- Generation: Using an LLM to answer based on that information
Example:
User: "What is our company's vacation policy?"
Without RAG:
AI: "I don't have access to your specific company policies."
With RAG:
1. Retrieve: Find "vacation policy" from company handbook vector DB
2. Generate: "According to the handbook, employees receive 15 days
of paid vacation per year, accruing at 1.25 days per month..."
The RAG Architecture
┌─────────────┐
│ User Query │
└──────┬──────┘
│
┌──────▼──────┐
│ Embed │ Convert query to vector
└──────┬──────┘
│
┌──────▼──────┐
│ Search │ Find similar vectors in DB
│ Vector DB │
└──────┬──────┘
│
┌──────▼──────┐
│ Retrieve │ Get top N relevant documents
│ Documents │
└──────┬──────┘
│
┌──────▼──────┐
│ LLM │ Generate answer using docs + query
└──────┬──────┘
│
┌──────▼──────┐
│ Answer │
└─────────────┘
Setting Up Supabase for RAG
1. Install dependencies:
npm install @supabase/supabase-js openai
2. Create a Supabase table:
-- Enable vector extension
create extension if not exists vector;
-- Create table
create table documents (
id bigserial primary key,
content text,
metadata jsonb,
embedding vector(1536)
);
-- Create index for fast similarity search
create index on documents using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
3. Embed and store documents:
import { createClient } from '@supabase/supabase-js'
import OpenAI from 'openai'
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_KEY!
)
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
async function addDocument(content: string, metadata: any = {}) {
// Generate embedding
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: content
})
const embedding = embeddingResponse.data[0].embedding
// Store in database
const { data, error } = await supabase
.from('documents')
.insert({
content,
metadata,
embedding
})
if (error) throw error
return data
}
// Add documents
await addDocument(
'Our company offers 15 days of paid vacation per year.',
{ source: 'handbook', section: 'benefits' }
)
await addDocument(
'TypeScript is a typed superset of JavaScript that compiles to plain JavaScript.',
{ source: 'tech-docs', category: 'languages' }
)
4. Retrieve relevant documents:
async function searchDocuments(query: string, limit: number = 5) {
// Embed the query
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query
})
const queryEmbedding = embeddingResponse.data[0].embedding
// Search for similar vectors
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_count: limit,
match_threshold: 0.7 // Similarity threshold (0-1)
})
if (error) throw error
return data
}
5. Create the RPC function in Supabase:
create or replace function match_documents (
query_embedding vector(1536),
match_count int default 5,
match_threshold float default 0.7
)
returns table (
id bigint,
content text,
metadata jsonb,
similarity float
)
language plpgsql
as $$
begin
return query
select
documents.id,
documents.content,
documents.metadata,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where 1 - (documents.embedding <=> query_embedding) > match_threshold
order by documents.embedding <=> query_embedding
limit match_count;
end;
$$;
Complete RAG Implementation
import { generateText } from 'ai'
import { openai as openaiProvider } from '@ai-sdk/openai'
async function answerWithRAG(question: string) {
// 1. Retrieve relevant documents
const docs = await searchDocuments(question)
// 2. Build context from retrieved docs
const context = docs
.map((doc, i) => `Document ${i + 1}:\n${doc.content}`)
.join('\n\n')
// 3. Generate answer with context
const { text } = await generateText({
model: openaiProvider('gpt-4-turbo'),
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer questions based on the provided context.
If the answer isn't in the context, say so.`
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`
}
]
})
return text
}
// Use it
const answer = await answerWithRAG('What is our vacation policy?')
console.log(answer)
// "According to our company handbook, employees receive 15 days of paid vacation per year."
4.3 Web Browsing: The Research Tool
Why Web Scraping for Agents?
LLMs have a knowledge cutoff date. To get current information, agents need to:
- Search the web
- Scrape and parse web pages
- Extract relevant information
Using Firecrawl
Firecrawl is a web scraping API designed for LLMs—it returns clean, markdown-formatted content.
Install:
npm install @mendable/firecrawl-js
Basic usage:
import FirecrawlApp from '@mendable/firecrawl-js'
const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY })
async function scrapeWebsite(url: string) {
const result = await firecrawl.scrapeUrl(url, {
formats: ['markdown']
})
return result.markdown
}
// Use it
const content = await scrapeWebsite('https://example.com/article')
console.log(content)
Using Tavily (AI Search)
Tavily is a search API optimized for AI agents—it returns structured, relevant results.
Install:
npm install tavily
Usage:
import { tavily } from '@tavily/core'
const client = tavily({ apiKey: process.env.TAVILY_API_KEY })
async function searchWeb(query: string) {
const response = await client.search(query, {
search_depth: 'advanced',
max_results: 5
})
return response.results.map(r => ({
title: r.title,
url: r.url,
content: r.content
}))
}
// Use it
const results = await searchWeb('Tesla Q4 2024 earnings')
console.log(results)
Building a Research Tool
Combine search + scraping:
import { tool } from 'ai'
import { z } from 'zod'
const researchTool = tool({
description: 'Research a topic by searching the web and reading relevant pages',
parameters: z.object({
query: z.string().describe('The research query')
}),
execute: async ({ query }) => {
// 1. Search the web
const searchResults = await searchWeb(query)
// 2. Scrape top 3 results
const scrapedContent = await Promise.all(
searchResults.slice(0, 3).map(async (result) => {
try {
const content = await scrapeWebsite(result.url)
return {
url: result.url,
title: result.title,
content: content.slice(0, 2000) // Limit content
}
} catch (error) {
return {
url: result.url,
title: result.title,
content: 'Could not scrape this page'
}
}
})
)
return {
query,
results: scrapedContent
}
}
})
Complete Research Agent
const { text } = await generateText({
model: openaiProvider('gpt-4-turbo'),
messages: [
{
role: 'system',
content: 'You are a research assistant. Use the research tool to find current information.'
},
{
role: 'user',
content: 'What are the latest developments in AI agents?'
}
],
tools: { researchTool },
maxSteps: 5
})
console.log(text)
What happens:
1. Agent: "I need current info, let me search the web"
2. Calls researchTool({ query: 'latest AI agent developments 2024' })
3. Tool searches, scrapes top 3 articles
4. Agent reads the content
5. Agent: "Based on recent articles, the latest developments include..."
Combining RAG + Web Research
For the best of both worlds:
async function smartResearch(question: string) {
// 1. Check internal knowledge base first
const internalDocs = await searchDocuments(question)
if (internalDocs.length > 0 && internalDocs[0].similarity > 0.85) {
// High-confidence answer from internal docs
return answerWithRAG(question)
}
// 2. If not found internally, search the web
const webResults = await searchWeb(question)
const scrapedContent = await scrapeWebsite(webResults[0].url)
// 3. Generate answer with web content
const { text } = await generateText({
model: openaiProvider('gpt-4-turbo'),
messages: [
{
role: 'user',
content: `Based on this web content, answer: ${question}\n\nContent:\n${scrapedContent}`
}
]
})
// 4. Optionally, store the answer for future use
await addDocument(text, { source: 'web', query: question })
return text
}
Key Takeaways
- Short-term memory = conversation history (limited by tokens)
- Long-term memory = vector database (persistent, searchable)
- RAG = Retrieve relevant docs + Generate answers
- Web research tools let agents access current information
- Combine internal knowledge + web search for best results
For a deeper dive into how vector databases work and when to use them versus traditional SQL, check out our comprehensive guide: What Are Vector Databases? How They Work.
Exercise: Build a Document Q&A System
Create a system that can answer questions about uploaded PDFs:
- Extract text from PDFs
- Chunk the text into paragraphs
- Embed and store in Supabase
- Build a RAG agent that can answer questions
- Add citation (return which document/page the answer came from)
Next up: Module 5, where we build beautiful user interfaces with Next.js and the Vercel AI SDK.

