Module 4: Memory & Context (RAG)

Making it Smart

Introduction: Beyond Conversation History

So far, our agents have been limited to:

Their training data (frozen in time)
The current conversation
Real-time tool calls

But what if you need your agent to:

Remember user preferences across sessions
Answer questions about your company's documentation
Recall previous interactions from weeks ago
Search through thousands of documents instantly

This is where memory and Retrieval-Augmented Generation (RAG) come in.

4.1 Short-term vs. Long-term Memory

Short-term Memory: Conversation History

This is what we've been using:

const messages: CoreMessage[] = [
  { role: 'system', content: 'You are a helpful assistant' },
  { role: 'user', content: 'What is TypeScript?' },
  { role: 'assistant', content: 'TypeScript is a typed superset of JavaScript...' },
  { role: 'user', content: 'How does it help with errors?' }
  // The LLM can see all previous messages
]

Pros:

Simple to implement
No external storage needed
Perfect for single-session tasks

Cons:

Limited by context window (typically 4K-128K tokens)
Expensive (pay for all tokens every request)
Lost when session ends

Long-term Memory: Vector Storage

Store information persistently and retrieve relevant pieces on demand.

Use cases:

Chat with your company's documentation
Remember user preferences
Build a personal knowledge base
Search historical conversations

How it works:

1. EMBED: Convert text to vector (array of numbers)
   "TypeScript is a typed superset" → [0.23, -0.45, 0.67, ...]

2. STORE: Save vectors in a database
   Supabase pgvector, Pinecone, Weaviate, etc.

3. RETRIEVE: Find similar vectors when needed
   Query: "What is TypeScript?" → Find relevant docs

4. AUGMENT: Add retrieved docs to LLM context
   Generate answer using both the query and docs

Managing CoreMessage[] Arrays

Best practices for conversation history:

interface ConversationManager {
  messages: CoreMessage[]
  maxMessages: number
  maxTokens: number
}

function trimMessages(messages: CoreMessage[], maxTokens: number): CoreMessage[] {
  // Always keep system message
  const system = messages.find(m => m.role === 'system')
  const conversationMessages = messages.filter(m => m.role !== 'system')

  // Estimate tokens (rough: 1 token ≈ 4 characters)
  const estimateTokens = (text: string) => Math.ceil(text.length / 4)

  let totalTokens = system ? estimateTokens(system.content) : 0
  const kept: CoreMessage[] = []

  // Keep most recent messages that fit in token budget
  for (let i = conversationMessages.length - 1; i >= 0; i--) {
    const msg = conversationMessages[i]
    const tokens = estimateTokens(msg.content)

    if (totalTokens + tokens <= maxTokens) {
      kept.unshift(msg)
      totalTokens += tokens
    } else {
      break
    }
  }

  return system ? [system, ...kept] : kept
}

Summarization for Long Conversations

async function summarizeConversation(messages: CoreMessage[]): Promise<CoreMessage> {
  const llm = new ChatOpenAI({ modelName: 'gpt-4-turbo' })

  const summary = await llm.invoke([
    {
      role: 'user',
      content: `Summarize this conversation in 2-3 sentences:\n\n${
        messages.map(m => `${m.role}: ${m.content}`).join('\n')
      }`
    }
  ])

  return {
    role: 'system',
    content: `Previous conversation summary: ${summary.content}`
  }
}

// Use it
if (messages.length > 20) {
  const summary = await summarizeConversation(messages.slice(0, -10))
  messages = [summary, ...messages.slice(-10)]
}

4.2 The RAG Pipeline

What is RAG?

Retrieval-Augmented Generation combines:

Retrieval: Finding relevant information from a knowledge base
Generation: Using an LLM to answer based on that information

Example:

User: "What is our company's vacation policy?"

Without RAG:
AI: "I don't have access to your specific company policies."

With RAG:
1. Retrieve: Find "vacation policy" from company handbook vector DB
2. Generate: "According to the handbook, employees receive 15 days
             of paid vacation per year, accruing at 1.25 days per month..."

The RAG Architecture

┌─────────────┐
│ User Query  │
└──────┬──────┘
       │
┌──────▼──────┐
│   Embed     │ Convert query to vector
└──────┬──────┘
       │
┌──────▼──────┐
│  Search     │ Find similar vectors in DB
│  Vector DB  │
└──────┬──────┘
       │
┌──────▼──────┐
│  Retrieve   │ Get top N relevant documents
│  Documents  │
└──────┬──────┘
       │
┌──────▼──────┐
│    LLM      │ Generate answer using docs + query
└──────┬──────┘
       │
┌──────▼──────┐
│   Answer    │
└─────────────┘

Setting Up Supabase for RAG

1. Install dependencies:

npm install @supabase/supabase-js openai

2. Create a Supabase table:

-- Enable vector extension
create extension if not exists vector;

-- Create table
create table documents (
  id bigserial primary key,
  content text,
  metadata jsonb,
  embedding vector(1536)
);

-- Create index for fast similarity search
create index on documents using ivfflat (embedding vector_cosine_ops)
with (lists = 100);

3. Embed and store documents:

import { createClient } from '@supabase/supabase-js'
import OpenAI from 'openai'

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_KEY!
)

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })

async function addDocument(content: string, metadata: any = {}) {
  // Generate embedding
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: content
  })

  const embedding = embeddingResponse.data[0].embedding

  // Store in database
  const { data, error } = await supabase
    .from('documents')
    .insert({
      content,
      metadata,
      embedding
    })

  if (error) throw error
  return data
}

// Add documents
await addDocument(
  'Our company offers 15 days of paid vacation per year.',
  { source: 'handbook', section: 'benefits' }
)

await addDocument(
  'TypeScript is a typed superset of JavaScript that compiles to plain JavaScript.',
  { source: 'tech-docs', category: 'languages' }
)

4. Retrieve relevant documents:

async function searchDocuments(query: string, limit: number = 5) {
  // Embed the query
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query
  })

  const queryEmbedding = embeddingResponse.data[0].embedding

  // Search for similar vectors
  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: queryEmbedding,
    match_count: limit,
    match_threshold: 0.7 // Similarity threshold (0-1)
  })

  if (error) throw error
  return data
}

5. Create the RPC function in Supabase:

create or replace function match_documents (
  query_embedding vector(1536),
  match_count int default 5,
  match_threshold float default 0.7
)
returns table (
  id bigint,
  content text,
  metadata jsonb,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    documents.id,
    documents.content,
    documents.metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;

Complete RAG Implementation

import { generateText } from 'ai'
import { openai as openaiProvider } from '@ai-sdk/openai'

async function answerWithRAG(question: string) {
  // 1. Retrieve relevant documents
  const docs = await searchDocuments(question)

  // 2. Build context from retrieved docs
  const context = docs
    .map((doc, i) => `Document ${i + 1}:\n${doc.content}`)
    .join('\n\n')

  // 3. Generate answer with context
  const { text } = await generateText({
    model: openaiProvider('gpt-4-turbo'),
    messages: [
      {
        role: 'system',
        content: `You are a helpful assistant. Answer questions based on the provided context.
                  If the answer isn't in the context, say so.`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ]
  })

  return text
}

// Use it
const answer = await answerWithRAG('What is our vacation policy?')
console.log(answer)
// "According to our company handbook, employees receive 15 days of paid vacation per year."

4.3 Web Browsing: The Research Tool

Why Web Scraping for Agents?

LLMs have a knowledge cutoff date. To get current information, agents need to:

Search the web
Scrape and parse web pages
Extract relevant information

Using Firecrawl

Firecrawl is a web scraping API designed for LLMs—it returns clean, markdown-formatted content.

Install:

npm install @mendable/firecrawl-js

Basic usage:

import FirecrawlApp from '@mendable/firecrawl-js'

const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY })

async function scrapeWebsite(url: string) {
  const result = await firecrawl.scrapeUrl(url, {
    formats: ['markdown']
  })

  return result.markdown
}

// Use it
const content = await scrapeWebsite('https://example.com/article')
console.log(content)

Using Tavily (AI Search)

Tavily is a search API optimized for AI agents—it returns structured, relevant results.

Install:

npm install tavily

Usage:

import { tavily } from '@tavily/core'

const client = tavily({ apiKey: process.env.TAVILY_API_KEY })

async function searchWeb(query: string) {
  const response = await client.search(query, {
    search_depth: 'advanced',
    max_results: 5
  })

  return response.results.map(r => ({
    title: r.title,
    url: r.url,
    content: r.content
  }))
}

// Use it
const results = await searchWeb('Tesla Q4 2024 earnings')
console.log(results)

Building a Research Tool

Combine search + scraping:

import { tool } from 'ai'
import { z } from 'zod'

const researchTool = tool({
  description: 'Research a topic by searching the web and reading relevant pages',
  parameters: z.object({
    query: z.string().describe('The research query')
  }),
  execute: async ({ query }) => {
    // 1. Search the web
    const searchResults = await searchWeb(query)

    // 2. Scrape top 3 results
    const scrapedContent = await Promise.all(
      searchResults.slice(0, 3).map(async (result) => {
        try {
          const content = await scrapeWebsite(result.url)
          return {
            url: result.url,
            title: result.title,
            content: content.slice(0, 2000) // Limit content
          }
        } catch (error) {
          return {
            url: result.url,
            title: result.title,
            content: 'Could not scrape this page'
          }
        }
      })
    )

    return {
      query,
      results: scrapedContent
    }
  }
})

Complete Research Agent

const { text } = await generateText({
  model: openaiProvider('gpt-4-turbo'),
  messages: [
    {
      role: 'system',
      content: 'You are a research assistant. Use the research tool to find current information.'
    },
    {
      role: 'user',
      content: 'What are the latest developments in AI agents?'
    }
  ],
  tools: { researchTool },
  maxSteps: 5
})

console.log(text)

What happens:

1. Agent: "I need current info, let me search the web"
2. Calls researchTool({ query: 'latest AI agent developments 2024' })
3. Tool searches, scrapes top 3 articles
4. Agent reads the content
5. Agent: "Based on recent articles, the latest developments include..."

Combining RAG + Web Research

For the best of both worlds:

async function smartResearch(question: string) {
  // 1. Check internal knowledge base first
  const internalDocs = await searchDocuments(question)

  if (internalDocs.length > 0 && internalDocs[0].similarity > 0.85) {
    // High-confidence answer from internal docs
    return answerWithRAG(question)
  }

  // 2. If not found internally, search the web
  const webResults = await searchWeb(question)
  const scrapedContent = await scrapeWebsite(webResults[0].url)

  // 3. Generate answer with web content
  const { text } = await generateText({
    model: openaiProvider('gpt-4-turbo'),
    messages: [
      {
        role: 'user',
        content: `Based on this web content, answer: ${question}\n\nContent:\n${scrapedContent}`
      }
    ]
  })

  // 4. Optionally, store the answer for future use
  await addDocument(text, { source: 'web', query: question })

  return text
}

Key Takeaways

Short-term memory = conversation history (limited by tokens)
Long-term memory = vector database (persistent, searchable)
RAG = Retrieve relevant docs + Generate answers
Web research tools let agents access current information
Combine internal knowledge + web search for best results

For a deeper dive into how vector databases work and when to use them versus traditional SQL, check out our comprehensive guide: What Are Vector Databases? How They Work.

Exercise: Build a Document Q&A System

Create a system that can answer questions about uploaded PDFs:

Extract text from PDFs
Chunk the text into paragraphs
Embed and store in Supabase
Build a RAG agent that can answer questions
Add citation (return which document/page the answer came from)

Next up: Module 5, where we build beautiful user interfaces with Next.js and the Vercel AI SDK.

Module 4: Memory & Context (RAG)

Making it Smart

Introduction: Beyond Conversation History

So far, our agents have been limited to:

Their training data (frozen in time)
The current conversation
Real-time tool calls

But what if you need your agent to:

Remember user preferences across sessions
Answer questions about your company's documentation
Recall previous interactions from weeks ago
Search through thousands of documents instantly

This is where memory and Retrieval-Augmented Generation (RAG) come in.

4.1 Short-term vs. Long-term Memory

Short-term Memory: Conversation History

This is what we've been using:

const messages: CoreMessage[] = [
  { role: 'system', content: 'You are a helpful assistant' },
  { role: 'user', content: 'What is TypeScript?' },
  { role: 'assistant', content: 'TypeScript is a typed superset of JavaScript...' },
  { role: 'user', content: 'How does it help with errors?' }
  // The LLM can see all previous messages
]

Pros:

Simple to implement
No external storage needed
Perfect for single-session tasks

Cons:

Limited by context window (typically 4K-128K tokens)
Expensive (pay for all tokens every request)
Lost when session ends

Long-term Memory: Vector Storage

Store information persistently and retrieve relevant pieces on demand.

Use cases:

Chat with your company's documentation
Remember user preferences
Build a personal knowledge base
Search historical conversations

How it works:

1. EMBED: Convert text to vector (array of numbers)
   "TypeScript is a typed superset" → [0.23, -0.45, 0.67, ...]

2. STORE: Save vectors in a database
   Supabase pgvector, Pinecone, Weaviate, etc.

3. RETRIEVE: Find similar vectors when needed
   Query: "What is TypeScript?" → Find relevant docs

4. AUGMENT: Add retrieved docs to LLM context
   Generate answer using both the query and docs

Managing CoreMessage[] Arrays

Best practices for conversation history:

interface ConversationManager {
  messages: CoreMessage[]
  maxMessages: number
  maxTokens: number
}

function trimMessages(messages: CoreMessage[], maxTokens: number): CoreMessage[] {
  // Always keep system message
  const system = messages.find(m => m.role === 'system')
  const conversationMessages = messages.filter(m => m.role !== 'system')

  // Estimate tokens (rough: 1 token ≈ 4 characters)
  const estimateTokens = (text: string) => Math.ceil(text.length / 4)

  let totalTokens = system ? estimateTokens(system.content) : 0
  const kept: CoreMessage[] = []

  // Keep most recent messages that fit in token budget
  for (let i = conversationMessages.length - 1; i >= 0; i--) {
    const msg = conversationMessages[i]
    const tokens = estimateTokens(msg.content)

    if (totalTokens + tokens <= maxTokens) {
      kept.unshift(msg)
      totalTokens += tokens
    } else {
      break
    }
  }

  return system ? [system, ...kept] : kept
}

Summarization for Long Conversations

async function summarizeConversation(messages: CoreMessage[]): Promise<CoreMessage> {
  const llm = new ChatOpenAI({ modelName: 'gpt-4-turbo' })

  const summary = await llm.invoke([
    {
      role: 'user',
      content: `Summarize this conversation in 2-3 sentences:\n\n${
        messages.map(m => `${m.role}: ${m.content}`).join('\n')
      }`
    }
  ])

  return {
    role: 'system',
    content: `Previous conversation summary: ${summary.content}`
  }
}

// Use it
if (messages.length > 20) {
  const summary = await summarizeConversation(messages.slice(0, -10))
  messages = [summary, ...messages.slice(-10)]
}

4.2 The RAG Pipeline

What is RAG?

Retrieval-Augmented Generation combines:

Retrieval: Finding relevant information from a knowledge base
Generation: Using an LLM to answer based on that information

Example:

User: "What is our company's vacation policy?"

Without RAG:
AI: "I don't have access to your specific company policies."

With RAG:
1. Retrieve: Find "vacation policy" from company handbook vector DB
2. Generate: "According to the handbook, employees receive 15 days
             of paid vacation per year, accruing at 1.25 days per month..."

The RAG Architecture

┌─────────────┐
│ User Query  │
└──────┬──────┘
       │
┌──────▼──────┐
│   Embed     │ Convert query to vector
└──────┬──────┘
       │
┌──────▼──────┐
│  Search     │ Find similar vectors in DB
│  Vector DB  │
└──────┬──────┘
       │
┌──────▼──────┐
│  Retrieve   │ Get top N relevant documents
│  Documents  │
└──────┬──────┘
       │
┌──────▼──────┐
│    LLM      │ Generate answer using docs + query
└──────┬──────┘
       │
┌──────▼──────┐
│   Answer    │
└─────────────┘

Setting Up Supabase for RAG

1. Install dependencies:

npm install @supabase/supabase-js openai

2. Create a Supabase table:

-- Enable vector extension
create extension if not exists vector;

-- Create table
create table documents (
  id bigserial primary key,
  content text,
  metadata jsonb,
  embedding vector(1536)
);

-- Create index for fast similarity search
create index on documents using ivfflat (embedding vector_cosine_ops)
with (lists = 100);

3. Embed and store documents:

import { createClient } from '@supabase/supabase-js'
import OpenAI from 'openai'

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_KEY!
)

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })

async function addDocument(content: string, metadata: any = {}) {
  // Generate embedding
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: content
  })

  const embedding = embeddingResponse.data[0].embedding

  // Store in database
  const { data, error } = await supabase
    .from('documents')
    .insert({
      content,
      metadata,
      embedding
    })

  if (error) throw error
  return data
}

// Add documents
await addDocument(
  'Our company offers 15 days of paid vacation per year.',
  { source: 'handbook', section: 'benefits' }
)

await addDocument(
  'TypeScript is a typed superset of JavaScript that compiles to plain JavaScript.',
  { source: 'tech-docs', category: 'languages' }
)

4. Retrieve relevant documents:

async function searchDocuments(query: string, limit: number = 5) {
  // Embed the query
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query
  })

  const queryEmbedding = embeddingResponse.data[0].embedding

  // Search for similar vectors
  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: queryEmbedding,
    match_count: limit,
    match_threshold: 0.7 // Similarity threshold (0-1)
  })

  if (error) throw error
  return data
}

5. Create the RPC function in Supabase:

create or replace function match_documents (
  query_embedding vector(1536),
  match_count int default 5,
  match_threshold float default 0.7
)
returns table (
  id bigint,
  content text,
  metadata jsonb,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    documents.id,
    documents.content,
    documents.metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;

Complete RAG Implementation

import { generateText } from 'ai'
import { openai as openaiProvider } from '@ai-sdk/openai'

async function answerWithRAG(question: string) {
  // 1. Retrieve relevant documents
  const docs = await searchDocuments(question)

  // 2. Build context from retrieved docs
  const context = docs
    .map((doc, i) => `Document ${i + 1}:\n${doc.content}`)
    .join('\n\n')

  // 3. Generate answer with context
  const { text } = await generateText({
    model: openaiProvider('gpt-4-turbo'),
    messages: [
      {
        role: 'system',
        content: `You are a helpful assistant. Answer questions based on the provided context.
                  If the answer isn't in the context, say so.`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ]
  })

  return text
}

// Use it
const answer = await answerWithRAG('What is our vacation policy?')
console.log(answer)
// "According to our company handbook, employees receive 15 days of paid vacation per year."

4.3 Web Browsing: The Research Tool

Why Web Scraping for Agents?

LLMs have a knowledge cutoff date. To get current information, agents need to:

Search the web
Scrape and parse web pages
Extract relevant information

Using Firecrawl

Firecrawl is a web scraping API designed for LLMs—it returns clean, markdown-formatted content.

Install:

npm install @mendable/firecrawl-js

Basic usage:

import FirecrawlApp from '@mendable/firecrawl-js'

const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY })

async function scrapeWebsite(url: string) {
  const result = await firecrawl.scrapeUrl(url, {
    formats: ['markdown']
  })

  return result.markdown
}

// Use it
const content = await scrapeWebsite('https://example.com/article')
console.log(content)

Using Tavily (AI Search)

Tavily is a search API optimized for AI agents—it returns structured, relevant results.

Install:

npm install tavily

Usage:

import { tavily } from '@tavily/core'

const client = tavily({ apiKey: process.env.TAVILY_API_KEY })

async function searchWeb(query: string) {
  const response = await client.search(query, {
    search_depth: 'advanced',
    max_results: 5
  })

  return response.results.map(r => ({
    title: r.title,
    url: r.url,
    content: r.content
  }))
}

// Use it
const results = await searchWeb('Tesla Q4 2024 earnings')
console.log(results)

Building a Research Tool

Combine search + scraping:

import { tool } from 'ai'
import { z } from 'zod'

const researchTool = tool({
  description: 'Research a topic by searching the web and reading relevant pages',
  parameters: z.object({
    query: z.string().describe('The research query')
  }),
  execute: async ({ query }) => {
    // 1. Search the web
    const searchResults = await searchWeb(query)

    // 2. Scrape top 3 results
    const scrapedContent = await Promise.all(
      searchResults.slice(0, 3).map(async (result) => {
        try {
          const content = await scrapeWebsite(result.url)
          return {
            url: result.url,
            title: result.title,
            content: content.slice(0, 2000) // Limit content
          }
        } catch (error) {
          return {
            url: result.url,
            title: result.title,
            content: 'Could not scrape this page'
          }
        }
      })
    )

    return {
      query,
      results: scrapedContent
    }
  }
})

Complete Research Agent

const { text } = await generateText({
  model: openaiProvider('gpt-4-turbo'),
  messages: [
    {
      role: 'system',
      content: 'You are a research assistant. Use the research tool to find current information.'
    },
    {
      role: 'user',
      content: 'What are the latest developments in AI agents?'
    }
  ],
  tools: { researchTool },
  maxSteps: 5
})

console.log(text)

What happens:

1. Agent: "I need current info, let me search the web"
2. Calls researchTool({ query: 'latest AI agent developments 2024' })
3. Tool searches, scrapes top 3 articles
4. Agent reads the content
5. Agent: "Based on recent articles, the latest developments include..."

Combining RAG + Web Research

For the best of both worlds:

async function smartResearch(question: string) {
  // 1. Check internal knowledge base first
  const internalDocs = await searchDocuments(question)

  if (internalDocs.length > 0 && internalDocs[0].similarity > 0.85) {
    // High-confidence answer from internal docs
    return answerWithRAG(question)
  }

  // 2. If not found internally, search the web
  const webResults = await searchWeb(question)
  const scrapedContent = await scrapeWebsite(webResults[0].url)

  // 3. Generate answer with web content
  const { text } = await generateText({
    model: openaiProvider('gpt-4-turbo'),
    messages: [
      {
        role: 'user',
        content: `Based on this web content, answer: ${question}\n\nContent:\n${scrapedContent}`
      }
    ]
  })

  // 4. Optionally, store the answer for future use
  await addDocument(text, { source: 'web', query: question })

  return text
}

Key Takeaways

Short-term memory = conversation history (limited by tokens)
Long-term memory = vector database (persistent, searchable)
RAG = Retrieve relevant docs + Generate answers
Web research tools let agents access current information
Combine internal knowledge + web search for best results

For a deeper dive into how vector databases work and when to use them versus traditional SQL, check out our comprehensive guide: What Are Vector Databases? How They Work.

Exercise: Build a Document Q&A System

Create a system that can answer questions about uploaded PDFs:

Extract text from PDFs
Chunk the text into paragraphs
Embed and store in Supabase
Build a RAG agent that can answer questions
Add citation (return which document/page the answer came from)

Next up: Module 5, where we build beautiful user interfaces with Next.js and the Vercel AI SDK.