Module 5: Memory & RAG for Agents
Giving Agents Knowledge and Context
Introduction: Why Memory Matters
So far, our agents can reason, use tools, and follow stateful workflows. But they have a critical limitation: they forget everything between conversations and have no knowledge beyond their training data.
In this module, we solve both problems. We'll give agents short-term memory to maintain conversation context and long-term memory powered by vector stores to recall information across sessions. Then we'll combine these with Retrieval-Augmented Generation (RAG) to let agents reason over your own documents.
This is where agents go from impressive demos to genuinely useful systems.
5.1 Short-Term Memory: Conversation Context
The Problem
Without memory, every message to an agent is a fresh start:
# Without memory
agent.invoke({"input": "My name is Alice."})
# => "Nice to meet you, Alice!"
agent.invoke({"input": "What's my name?"})
# => "I don't know your name. Could you tell me?"
The agent has no recollection of previous messages. We need to pass conversation history along with each request.
ConversationBufferMemory
The simplest approach is to store the entire conversation and pass it every time:
from langchain_core.messages import HumanMessage, AIMessage
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
# Maintain a message history
conversation_history = []
def chat(user_input: str) -> str:
conversation_history.append(HumanMessage(content=user_input))
response = llm.invoke(conversation_history)
conversation_history.append(AIMessage(content=response.content))
return response.content
# Now the agent remembers
print(chat("My name is Alice."))
# => "Nice to meet you, Alice!"
print(chat("What's my name?"))
# => "Your name is Alice!"
This works, but every message in the history consumes tokens. For long conversations, costs and latency grow quickly.
ConversationBufferWindowMemory
A practical improvement is to keep only the last N exchanges:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
WINDOW_SIZE = 10 # Keep last 10 messages
system_message = SystemMessage(
content="You are a helpful assistant. Be concise and friendly."
)
conversation_history = []
def chat(user_input: str) -> str:
conversation_history.append(HumanMessage(content=user_input))
# Trim to window size
recent_messages = conversation_history[-WINDOW_SIZE:]
# Always include system message
messages = [system_message] + recent_messages
response = llm.invoke(messages)
conversation_history.append(AIMessage(content=response.content))
return response.content
ConversationSummaryMemory
For the best of both worlds, summarize older messages while keeping recent ones intact:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
llm = ChatOpenAI(model="gpt-4o")
conversation_history = []
summary = ""
def summarize_conversation(messages: list, existing_summary: str) -> str:
"""Use the LLM to summarize conversation history."""
summary_prompt = f"""Progressively summarize the conversation, adding to the existing summary.
Current summary: {existing_summary}
New messages:
{chr(10).join(f'{m.type}: {m.content}' for m in messages)}
Updated summary:"""
response = llm.invoke([HumanMessage(content=summary_prompt)])
return response.content
def chat(user_input: str, window_size: int = 6) -> str:
conversation_history.append(HumanMessage(content=user_input))
global summary
# If history exceeds window, summarize older messages
if len(conversation_history) > window_size:
older_messages = conversation_history[:-window_size]
summary = summarize_conversation(older_messages, summary)
# Keep only recent messages
del conversation_history[:-window_size]
system_content = "You are a helpful assistant."
if summary:
system_content += f"\n\nConversation summary so far: {summary}"
messages = [SystemMessage(content=system_content)] + conversation_history
response = llm.invoke(messages)
conversation_history.append(AIMessage(content=response.content))
return response.content
This approach keeps costs predictable while preserving important context from earlier in the conversation.
5.2 Long-Term Memory: Vector Stores
The Concept
Short-term memory handles a single conversation. But what if your agent needs to:
- Remember user preferences across sessions?
- Access a knowledge base of thousands of documents?
- Search through company policies or product documentation?
This is where vector stores come in.
What Are Embeddings?
An embedding is a numerical representation of text as a list of numbers (a vector). Similar texts produce similar vectors:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Generate embeddings
vec1 = embeddings.embed_query("The cat sat on the mat")
vec2 = embeddings.embed_query("A feline rested on the rug")
vec3 = embeddings.embed_query("Stock markets closed higher today")
print(f"Vector dimensions: {len(vec1)}") # 1536
# vec1 and vec2 will be very similar (both about cats)
# vec1 and vec3 will be very different (unrelated topics)
The key insight: semantic similarity becomes mathematical distance. We can find relevant documents by finding vectors that are close to our query vector.
Setting Up ChromaDB
ChromaDB is a lightweight, open-source vector database perfect for development:
pip install chromadb langchain-chroma
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create a persistent vector store
vectorstore = Chroma(
collection_name="my_knowledge_base",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
# Add documents
documents = [
"Our return policy allows returns within 30 days of purchase.",
"Shipping is free for orders over $50.",
"Premium members get 20% off all products.",
"Customer support is available Monday through Friday, 9 AM to 5 PM.",
"We accept Visa, Mastercard, and PayPal.",
]
vectorstore.add_texts(documents)
# Search for relevant documents
results = vectorstore.similarity_search("How do I return an item?", k=2)
for doc in results:
print(doc.page_content)
# => "Our return policy allows returns within 30 days of purchase."
# => "Customer support is available Monday through Friday, 9 AM to 5 PM."
Setting Up FAISS
FAISS (Facebook AI Similarity Search) is another popular option, especially for large-scale applications:
pip install faiss-cpu langchain-community
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
documents = [
"Python was created by Guido van Rossum in 1991.",
"LangChain is a framework for building LLM applications.",
"Vector databases store embeddings for similarity search.",
"RAG combines retrieval with generation for accurate answers.",
]
# Create FAISS index from documents
vectorstore = FAISS.from_texts(documents, embeddings)
# Save to disk
vectorstore.save_local("./faiss_index")
# Load from disk later
loaded_store = FAISS.load_local(
"./faiss_index",
embeddings,
allow_dangerous_deserialization=True
)
# Search
results = loaded_store.similarity_search("What is LangChain?", k=2)
for doc in results:
print(doc.page_content)
5.3 Retrieval-Augmented Generation (RAG)
The RAG Pattern
RAG is a three-step process:
1. RETRIEVE: Find relevant documents using vector similarity
2. AUGMENT: Add those documents to the LLM's context
3. GENERATE: Let the LLM answer using the retrieved information
This gives the LLM access to knowledge it was never trained on, with citations back to source material.
Building a Basic RAG Chain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create vectorstore with some knowledge
documents = [
"LangGraph is a library for building stateful multi-agent applications.",
"LangGraph uses a graph-based approach where nodes are functions and edges define flow.",
"State in LangGraph is a TypedDict that flows through the graph.",
"LangGraph supports conditional edges for dynamic routing between nodes.",
"Human-in-the-loop can be implemented using interrupt_before in LangGraph.",
]
vectorstore = Chroma.from_texts(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# RAG prompt template
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Build the RAG chain
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Ask questions
answer = rag_chain.invoke("How does state work in LangGraph?")
print(answer)
# => "State in LangGraph is a TypedDict that flows through the graph..."
RAG with Source Citations
For production use, you often want to know which documents informed the answer:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_texts(
texts=["Returns accepted within 30 days.", "Free shipping over $50."],
metadatas=[
{"source": "returns-policy.pdf", "page": 1},
{"source": "shipping-guide.pdf", "page": 3},
],
embedding=embeddings,
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
def retrieve_with_sources(question: str):
docs = retriever.invoke(question)
context = "\n\n".join(
f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
for doc in docs
)
template = """Answer based on the context below. Cite your sources.
Context:
{context}
Question: {question}"""
prompt = ChatPromptTemplate.from_template(template)
chain = prompt | llm | StrOutputParser()
answer = chain.invoke({"context": context, "question": question})
sources = [doc.metadata.get("source") for doc in docs]
return {"answer": answer, "sources": sources}
result = retrieve_with_sources("What is your return policy?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
5.4 RAG-Powered Agents
Agents That Reason Over Documents
The real power comes when you combine RAG with an agent that can decide when and how to search:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Build a knowledge base
product_docs = [
"The Pro Plan costs $29/month and includes unlimited API calls.",
"The Starter Plan is free and includes 1000 API calls per month.",
"The Enterprise Plan has custom pricing and dedicated support.",
"All plans include email support with 24-hour response time.",
"Pro and Enterprise plans include priority chat support.",
"API rate limits: Starter 10 req/s, Pro 100 req/s, Enterprise unlimited.",
]
vectorstore = Chroma.from_texts(product_docs, embeddings)
@tool
def search_knowledge_base(query: str) -> str:
"""Search the product knowledge base for relevant information."""
docs = vectorstore.similarity_search(query, k=3)
return "\n".join(doc.page_content for doc in docs)
@tool
def get_current_date() -> str:
"""Get the current date."""
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d")
# Create a RAG-powered agent
agent = create_react_agent(
model=llm,
tools=[search_knowledge_base, get_current_date],
prompt="You are a helpful product support agent. Use the knowledge base to answer questions accurately.",
)
# The agent decides when to search
response = agent.invoke({
"messages": [{"role": "user", "content": "What's the rate limit on the Pro plan?"}]
})
print(response["messages"][-1].content)
# The agent searches the KB and responds: "The Pro Plan has a rate limit of 100 requests per second."
5.5 Loading Real Documents
Processing PDFs
For real-world applications, you need to load and chunk actual documents:
pip install pypdf
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Load a PDF
loader = PyPDFLoader("company_handbook.pdf")
pages = loader.load()
print(f"Loaded {len(pages)} pages")
# Split into chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(pages)
print(f"Created {len(chunks)} chunks")
# Create vector store from chunks
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./handbook_db"
)
print("Knowledge base ready!")
Why Chunking Matters
Documents need to be split into smaller pieces because:
- Embedding quality: Shorter texts produce more focused embeddings
- Context windows: LLMs have token limits
- Precision: Smaller chunks mean more relevant retrieval results
The RecursiveCharacterTextSplitter tries to split at natural boundaries (paragraphs, sentences) rather than arbitrary character positions.
Project: Document Q&A Agent
Let's build a complete document Q&A agent that loads PDFs, creates embeddings, and answers questions with citations.
Setup
mkdir document-qa-agent
cd document-qa-agent
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install langchain langchain-openai langchain-chroma langgraph pypdf
Create .env:
OPENAI_API_KEY=your_api_key_here
The Complete Agent
Create document_qa_agent.py:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
load_dotenv()
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# --- Step 1: Load and index documents ---
def build_knowledge_base(pdf_paths: list[str], persist_dir: str = "./qa_db") -> Chroma:
"""Load PDFs and create a searchable vector store."""
all_chunks = []
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
for pdf_path in pdf_paths:
print(f"Loading: {pdf_path}")
loader = PyPDFLoader(pdf_path)
pages = loader.load()
chunks = text_splitter.split_documents(pages)
all_chunks.extend(chunks)
print(f" -> {len(chunks)} chunks created")
print(f"\nTotal chunks: {len(all_chunks)}")
print("Building vector store...")
vectorstore = Chroma.from_documents(
documents=all_chunks,
embedding=embeddings,
persist_directory=persist_dir,
)
print("Knowledge base ready!\n")
return vectorstore
# --- Step 2: Define tools ---
vectorstore = None # Will be initialized in main()
@tool
def search_documents(query: str) -> str:
"""Search the loaded documents for information relevant to the query.
Returns the most relevant passages with source information."""
if vectorstore is None:
return "No documents have been loaded yet."
results = vectorstore.similarity_search(query, k=4)
if not results:
return "No relevant information found."
output = []
for i, doc in enumerate(results, 1):
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "?")
output.append(f"[Result {i} | Source: {source}, Page: {page}]")
output.append(doc.page_content)
output.append("")
return "\n".join(output)
@tool
def list_loaded_documents() -> str:
"""List all documents that have been loaded into the knowledge base."""
if vectorstore is None:
return "No documents loaded."
docs = vectorstore.get()
sources = set()
for metadata in docs["metadatas"]:
sources.add(metadata.get("source", "unknown"))
return f"Loaded documents: {', '.join(sorted(sources))}"
# --- Step 3: Create the agent ---
def create_qa_agent():
"""Create a document Q&A agent with RAG capabilities."""
system_prompt = """You are a helpful document Q&A assistant. Your job is to answer
questions based on the loaded documents.
Rules:
1. ALWAYS search the documents before answering a question.
2. Base your answers ONLY on information found in the documents.
3. If the documents don't contain the answer, say so clearly.
4. Always cite which document and page the information came from.
5. If a question is ambiguous, search for multiple interpretations.
Be thorough, accurate, and helpful."""
agent = create_react_agent(
model=llm,
tools=[search_documents, list_loaded_documents],
prompt=system_prompt,
)
return agent
# --- Step 4: Interactive chat loop ---
def main():
global vectorstore
# Load your PDFs here
pdf_files = []
# Check for PDFs in a ./docs directory
docs_dir = "./docs"
if os.path.exists(docs_dir):
pdf_files = [
os.path.join(docs_dir, f)
for f in os.listdir(docs_dir)
if f.endswith(".pdf")
]
if not pdf_files:
print("No PDF files found in ./docs directory.")
print("Create a ./docs folder and add PDF files, then restart.")
print("\nStarting with an empty knowledge base for demo purposes.\n")
vectorstore = Chroma(
embedding_function=embeddings,
persist_directory="./qa_db",
)
else:
vectorstore = build_knowledge_base(pdf_files)
agent = create_qa_agent()
print("Document Q&A Agent Ready!")
print("Ask questions about your documents. Type 'quit' to exit.\n")
message_history = []
while True:
user_input = input("You: ").strip()
if user_input.lower() in ("quit", "exit"):
print("Goodbye!")
break
if not user_input:
continue
message_history.append({"role": "user", "content": user_input})
response = agent.invoke({"messages": message_history})
assistant_message = response["messages"][-1].content
print(f"\nAgent: {assistant_message}\n")
message_history.append({"role": "assistant", "content": assistant_message})
if __name__ == "__main__":
main()
Running the Agent
# Create a docs folder and add PDFs
mkdir docs
# Copy your PDF files into the docs/ folder
# Run the agent
python document_qa_agent.py
Example Interaction
Loading: ./docs/company_handbook.pdf
-> 47 chunks created
Loading: ./docs/product_guide.pdf
-> 32 chunks created
Total chunks: 79
Building vector store...
Knowledge base ready!
Document Q&A Agent Ready!
Ask questions about your documents. Type 'quit' to exit.
You: What is the vacation policy?
Agent: Based on the company handbook (page 12), employees receive:
- 15 days of paid vacation per year for the first 3 years
- 20 days after 3 years of service
- 25 days after 7 years of service
Vacation days do not roll over to the next year (Source: company_handbook.pdf, Page 12).
You: What products do we offer?
Agent: According to the product guide, the company offers three main products:
1. **DataSync Pro** - Enterprise data synchronization (Page 3)
2. **CloudWatch** - Infrastructure monitoring (Page 8)
3. **APIForge** - API management platform (Page 15)
(Source: product_guide.pdf)
Key Takeaways
- Short-term memory keeps conversation context within a session using message history
- Embeddings convert text into numerical vectors where semantic similarity equals mathematical proximity
- Vector stores (ChromaDB, FAISS) enable fast similarity search over large document collections
- RAG grounds LLM responses in your actual data by retrieving relevant context before generating
- Chunking strategy matters: split documents at natural boundaries with appropriate overlap
- RAG-powered agents combine autonomous reasoning with document knowledge for accurate, cited answers
Exercise: Extend the Document Q&A Agent
Before moving to Module 6, try these enhancements:
- Add a tool that lets the agent summarize an entire document, not just answer specific questions
- Implement conversation memory so the agent remembers follow-up questions
- Add metadata filtering so the agent can search within a specific document
- Experiment with different chunk sizes and overlap values to see how they affect answer quality
Next up: Module 6, where we build multi-agent systems with agents that collaborate and delegate tasks.

