Module 5: Memory & RAG for Agents

Giving Agents Knowledge and Context

Introduction: Why Memory Matters

So far, our agents can reason, use tools, and follow stateful workflows. But they have a critical limitation: they forget everything between conversations and have no knowledge beyond their training data.

In this module, we solve both problems. We'll give agents short-term memory to maintain conversation context and long-term memory powered by vector stores to recall information across sessions. Then we'll combine these with Retrieval-Augmented Generation (RAG) to let agents reason over your own documents.

This is where agents go from impressive demos to genuinely useful systems.

5.1 Short-Term Memory: Conversation Context

The Problem

Without memory, every message to an agent is a fresh start:

# Without memory
agent.invoke({"input": "My name is Alice."})
# => "Nice to meet you, Alice!"

agent.invoke({"input": "What's my name?"})
# => "I don't know your name. Could you tell me?"

The agent has no recollection of previous messages. We need to pass conversation history along with each request.

ConversationBufferMemory

The simplest approach is to store the entire conversation and pass it every time:

from langchain_core.messages import HumanMessage, AIMessage
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

# Maintain a message history
conversation_history = []

def chat(user_input: str) -> str:
    conversation_history.append(HumanMessage(content=user_input))

    response = llm.invoke(conversation_history)

    conversation_history.append(AIMessage(content=response.content))
    return response.content

# Now the agent remembers
print(chat("My name is Alice."))
# => "Nice to meet you, Alice!"

print(chat("What's my name?"))
# => "Your name is Alice!"

This works, but every message in the history consumes tokens. For long conversations, costs and latency grow quickly.

ConversationBufferWindowMemory

A practical improvement is to keep only the last N exchanges:

from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

WINDOW_SIZE = 10  # Keep last 10 messages

system_message = SystemMessage(
    content="You are a helpful assistant. Be concise and friendly."
)
conversation_history = []

def chat(user_input: str) -> str:
    conversation_history.append(HumanMessage(content=user_input))

    # Trim to window size
    recent_messages = conversation_history[-WINDOW_SIZE:]

    # Always include system message
    messages = [system_message] + recent_messages

    response = llm.invoke(messages)

    conversation_history.append(AIMessage(content=response.content))
    return response.content

ConversationSummaryMemory

For the best of both worlds, summarize older messages while keeping recent ones intact:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage

llm = ChatOpenAI(model="gpt-4o")

conversation_history = []
summary = ""

def summarize_conversation(messages: list, existing_summary: str) -> str:
    """Use the LLM to summarize conversation history."""
    summary_prompt = f"""Progressively summarize the conversation, adding to the existing summary.

Current summary: {existing_summary}

New messages:
{chr(10).join(f'{m.type}: {m.content}' for m in messages)}

Updated summary:"""

    response = llm.invoke([HumanMessage(content=summary_prompt)])
    return response.content

def chat(user_input: str, window_size: int = 6) -> str:
    conversation_history.append(HumanMessage(content=user_input))

    global summary
    # If history exceeds window, summarize older messages
    if len(conversation_history) > window_size:
        older_messages = conversation_history[:-window_size]
        summary = summarize_conversation(older_messages, summary)
        # Keep only recent messages
        del conversation_history[:-window_size]

    system_content = "You are a helpful assistant."
    if summary:
        system_content += f"\n\nConversation summary so far: {summary}"

    messages = [SystemMessage(content=system_content)] + conversation_history

    response = llm.invoke(messages)
    conversation_history.append(AIMessage(content=response.content))
    return response.content

This approach keeps costs predictable while preserving important context from earlier in the conversation.

5.2 Long-Term Memory: Vector Stores

The Concept

Short-term memory handles a single conversation. But what if your agent needs to:

Remember user preferences across sessions?
Access a knowledge base of thousands of documents?
Search through company policies or product documentation?

This is where vector stores come in.

What Are Embeddings?

An embedding is a numerical representation of text as a list of numbers (a vector). Similar texts produce similar vectors:

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Generate embeddings
vec1 = embeddings.embed_query("The cat sat on the mat")
vec2 = embeddings.embed_query("A feline rested on the rug")
vec3 = embeddings.embed_query("Stock markets closed higher today")

print(f"Vector dimensions: {len(vec1)}")  # 1536

# vec1 and vec2 will be very similar (both about cats)
# vec1 and vec3 will be very different (unrelated topics)

The key insight: semantic similarity becomes mathematical distance. We can find relevant documents by finding vectors that are close to our query vector.

Setting Up ChromaDB

ChromaDB is a lightweight, open-source vector database perfect for development:

pip install chromadb langchain-chroma

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create a persistent vector store
vectorstore = Chroma(
    collection_name="my_knowledge_base",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

# Add documents
documents = [
    "Our return policy allows returns within 30 days of purchase.",
    "Shipping is free for orders over $50.",
    "Premium members get 20% off all products.",
    "Customer support is available Monday through Friday, 9 AM to 5 PM.",
    "We accept Visa, Mastercard, and PayPal.",
]

vectorstore.add_texts(documents)

# Search for relevant documents
results = vectorstore.similarity_search("How do I return an item?", k=2)

for doc in results:
    print(doc.page_content)
# => "Our return policy allows returns within 30 days of purchase."
# => "Customer support is available Monday through Friday, 9 AM to 5 PM."

Setting Up FAISS

FAISS (Facebook AI Similarity Search) is another popular option, especially for large-scale applications:

pip install faiss-cpu langchain-community

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

documents = [
    "Python was created by Guido van Rossum in 1991.",
    "LangChain is a framework for building LLM applications.",
    "Vector databases store embeddings for similarity search.",
    "RAG combines retrieval with generation for accurate answers.",
]

# Create FAISS index from documents
vectorstore = FAISS.from_texts(documents, embeddings)

# Save to disk
vectorstore.save_local("./faiss_index")

# Load from disk later
loaded_store = FAISS.load_local(
    "./faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

# Search
results = loaded_store.similarity_search("What is LangChain?", k=2)
for doc in results:
    print(doc.page_content)

5.3 Retrieval-Augmented Generation (RAG)

The RAG Pattern

RAG is a three-step process:

1. RETRIEVE: Find relevant documents using vector similarity
2. AUGMENT: Add those documents to the LLM's context
3. GENERATE: Let the LLM answer using the retrieved information

This gives the LLM access to knowledge it was never trained on, with citations back to source material.

Building a Basic RAG Chain

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vectorstore with some knowledge
documents = [
    "LangGraph is a library for building stateful multi-agent applications.",
    "LangGraph uses a graph-based approach where nodes are functions and edges define flow.",
    "State in LangGraph is a TypedDict that flows through the graph.",
    "LangGraph supports conditional edges for dynamic routing between nodes.",
    "Human-in-the-loop can be implemented using interrupt_before in LangGraph.",
]

vectorstore = Chroma.from_texts(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# RAG prompt template
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Build the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask questions
answer = rag_chain.invoke("How does state work in LangGraph?")
print(answer)
# => "State in LangGraph is a TypedDict that flows through the graph..."

RAG with Source Citations

For production use, you often want to know which documents informed the answer:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_texts(
    texts=["Returns accepted within 30 days.", "Free shipping over $50."],
    metadatas=[
        {"source": "returns-policy.pdf", "page": 1},
        {"source": "shipping-guide.pdf", "page": 3},
    ],
    embedding=embeddings,
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

def retrieve_with_sources(question: str):
    docs = retriever.invoke(question)

    context = "\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in docs
    )

    template = """Answer based on the context below. Cite your sources.

Context:
{context}

Question: {question}"""

    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | llm | StrOutputParser()

    answer = chain.invoke({"context": context, "question": question})
    sources = [doc.metadata.get("source") for doc in docs]

    return {"answer": answer, "sources": sources}

result = retrieve_with_sources("What is your return policy?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

5.4 RAG-Powered Agents

Agents That Reason Over Documents

The real power comes when you combine RAG with an agent that can decide when and how to search:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent

llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Build a knowledge base
product_docs = [
    "The Pro Plan costs $29/month and includes unlimited API calls.",
    "The Starter Plan is free and includes 1000 API calls per month.",
    "The Enterprise Plan has custom pricing and dedicated support.",
    "All plans include email support with 24-hour response time.",
    "Pro and Enterprise plans include priority chat support.",
    "API rate limits: Starter 10 req/s, Pro 100 req/s, Enterprise unlimited.",
]

vectorstore = Chroma.from_texts(product_docs, embeddings)

@tool
def search_knowledge_base(query: str) -> str:
    """Search the product knowledge base for relevant information."""
    docs = vectorstore.similarity_search(query, k=3)
    return "\n".join(doc.page_content for doc in docs)

@tool
def get_current_date() -> str:
    """Get the current date."""
    from datetime import datetime
    return datetime.now().strftime("%Y-%m-%d")

# Create a RAG-powered agent
agent = create_react_agent(
    model=llm,
    tools=[search_knowledge_base, get_current_date],
    prompt="You are a helpful product support agent. Use the knowledge base to answer questions accurately.",
)

# The agent decides when to search
response = agent.invoke({
    "messages": [{"role": "user", "content": "What's the rate limit on the Pro plan?"}]
})

print(response["messages"][-1].content)
# The agent searches the KB and responds: "The Pro Plan has a rate limit of 100 requests per second."

5.5 Loading Real Documents

Processing PDFs

For real-world applications, you need to load and chunk actual documents:

pip install pypdf

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Load a PDF
loader = PyPDFLoader("company_handbook.pdf")
pages = loader.load()

print(f"Loaded {len(pages)} pages")

# Split into chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(pages)
print(f"Created {len(chunks)} chunks")

# Create vector store from chunks
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./handbook_db"
)

print("Knowledge base ready!")

Why Chunking Matters

Documents need to be split into smaller pieces because:

Embedding quality: Shorter texts produce more focused embeddings
Context windows: LLMs have token limits
Precision: Smaller chunks mean more relevant retrieval results

The RecursiveCharacterTextSplitter tries to split at natural boundaries (paragraphs, sentences) rather than arbitrary character positions.

Project: Document Q&A Agent

Let's build a complete document Q&A agent that loads PDFs, creates embeddings, and answers questions with citations.

Setup

mkdir document-qa-agent
cd document-qa-agent
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install langchain langchain-openai langchain-chroma langgraph pypdf

Create .env:

OPENAI_API_KEY=your_api_key_here

The Complete Agent

Create document_qa_agent.py:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent

load_dotenv()

llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# --- Step 1: Load and index documents ---

def build_knowledge_base(pdf_paths: list[str], persist_dir: str = "./qa_db") -> Chroma:
    """Load PDFs and create a searchable vector store."""
    all_chunks = []
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )

    for pdf_path in pdf_paths:
        print(f"Loading: {pdf_path}")
        loader = PyPDFLoader(pdf_path)
        pages = loader.load()
        chunks = text_splitter.split_documents(pages)
        all_chunks.extend(chunks)
        print(f"  -> {len(chunks)} chunks created")

    print(f"\nTotal chunks: {len(all_chunks)}")
    print("Building vector store...")

    vectorstore = Chroma.from_documents(
        documents=all_chunks,
        embedding=embeddings,
        persist_directory=persist_dir,
    )

    print("Knowledge base ready!\n")
    return vectorstore


# --- Step 2: Define tools ---

vectorstore = None  # Will be initialized in main()

@tool
def search_documents(query: str) -> str:
    """Search the loaded documents for information relevant to the query.
    Returns the most relevant passages with source information."""
    if vectorstore is None:
        return "No documents have been loaded yet."

    results = vectorstore.similarity_search(query, k=4)

    if not results:
        return "No relevant information found."

    output = []
    for i, doc in enumerate(results, 1):
        source = doc.metadata.get("source", "unknown")
        page = doc.metadata.get("page", "?")
        output.append(f"[Result {i} | Source: {source}, Page: {page}]")
        output.append(doc.page_content)
        output.append("")

    return "\n".join(output)

@tool
def list_loaded_documents() -> str:
    """List all documents that have been loaded into the knowledge base."""
    if vectorstore is None:
        return "No documents loaded."

    docs = vectorstore.get()
    sources = set()
    for metadata in docs["metadatas"]:
        sources.add(metadata.get("source", "unknown"))

    return f"Loaded documents: {', '.join(sorted(sources))}"


# --- Step 3: Create the agent ---

def create_qa_agent():
    """Create a document Q&A agent with RAG capabilities."""
    system_prompt = """You are a helpful document Q&A assistant. Your job is to answer
questions based on the loaded documents.

Rules:
1. ALWAYS search the documents before answering a question.
2. Base your answers ONLY on information found in the documents.
3. If the documents don't contain the answer, say so clearly.
4. Always cite which document and page the information came from.
5. If a question is ambiguous, search for multiple interpretations.

Be thorough, accurate, and helpful."""

    agent = create_react_agent(
        model=llm,
        tools=[search_documents, list_loaded_documents],
        prompt=system_prompt,
    )

    return agent


# --- Step 4: Interactive chat loop ---

def main():
    global vectorstore

    # Load your PDFs here
    pdf_files = []

    # Check for PDFs in a ./docs directory
    docs_dir = "./docs"
    if os.path.exists(docs_dir):
        pdf_files = [
            os.path.join(docs_dir, f)
            for f in os.listdir(docs_dir)
            if f.endswith(".pdf")
        ]

    if not pdf_files:
        print("No PDF files found in ./docs directory.")
        print("Create a ./docs folder and add PDF files, then restart.")
        print("\nStarting with an empty knowledge base for demo purposes.\n")
        vectorstore = Chroma(
            embedding_function=embeddings,
            persist_directory="./qa_db",
        )
    else:
        vectorstore = build_knowledge_base(pdf_files)

    agent = create_qa_agent()

    print("Document Q&A Agent Ready!")
    print("Ask questions about your documents. Type 'quit' to exit.\n")

    message_history = []

    while True:
        user_input = input("You: ").strip()

        if user_input.lower() in ("quit", "exit"):
            print("Goodbye!")
            break

        if not user_input:
            continue

        message_history.append({"role": "user", "content": user_input})

        response = agent.invoke({"messages": message_history})

        assistant_message = response["messages"][-1].content
        print(f"\nAgent: {assistant_message}\n")

        message_history.append({"role": "assistant", "content": assistant_message})


if __name__ == "__main__":
    main()

Running the Agent

# Create a docs folder and add PDFs
mkdir docs
# Copy your PDF files into the docs/ folder

# Run the agent
python document_qa_agent.py

Example Interaction

Loading: ./docs/company_handbook.pdf
  -> 47 chunks created
Loading: ./docs/product_guide.pdf
  -> 32 chunks created

Total chunks: 79
Building vector store...
Knowledge base ready!

Document Q&A Agent Ready!
Ask questions about your documents. Type 'quit' to exit.

You: What is the vacation policy?
Agent: Based on the company handbook (page 12), employees receive:
- 15 days of paid vacation per year for the first 3 years
- 20 days after 3 years of service
- 25 days after 7 years of service
Vacation days do not roll over to the next year (Source: company_handbook.pdf, Page 12).

You: What products do we offer?
Agent: According to the product guide, the company offers three main products:
1. **DataSync Pro** - Enterprise data synchronization (Page 3)
2. **CloudWatch** - Infrastructure monitoring (Page 8)
3. **APIForge** - API management platform (Page 15)
(Source: product_guide.pdf)

Key Takeaways

Short-term memory keeps conversation context within a session using message history
Embeddings convert text into numerical vectors where semantic similarity equals mathematical proximity
Vector stores (ChromaDB, FAISS) enable fast similarity search over large document collections
RAG grounds LLM responses in your actual data by retrieving relevant context before generating
Chunking strategy matters: split documents at natural boundaries with appropriate overlap
RAG-powered agents combine autonomous reasoning with document knowledge for accurate, cited answers

Exercise: Extend the Document Q&A Agent

Before moving to Module 6, try these enhancements:

Add a tool that lets the agent summarize an entire document, not just answer specific questions
Implement conversation memory so the agent remembers follow-up questions
Add metadata filtering so the agent can search within a specific document
Experiment with different chunk sizes and overlap values to see how they affect answer quality

Next up: Module 6, where we build multi-agent systems with agents that collaborate and delegate tasks.