Load and Chunk Your Documents
Now we start working with your actual files. Before a document can be searched, it has to be read into your program and then split into small, bite-sized pieces. This lesson covers both: loading text from PDF and TXT files, and a simple, reliable way to chunk it. This is the first real code you will save and run.
What You'll Learn
- How to set up a Python environment for the project
- How to read text from a
.txtfile and a.pdffile - Why we split documents into chunks, and how big they should be
- A simple chunking function you can copy and use
First: Set Up Your Project
Create a folder for the project and install the few libraries we need. In your terminal:
mkdir my-knowledge-base
cd my-knowledge-base
pip install chromadb pypdf
- chromadb is the local database we will use to store and search embeddings (next lesson).
- pypdf reads text out of PDF files.
We already have Ollama and the models from earlier, so this is everything else you need.
Loading a Text File
Reading a plain text file is the simplest case. Save this as load.py:
def load_txt(path):
with open(path, "r", encoding="utf-8") as f:
return f.read()
text = load_txt("notes.txt")
print(text[:300]) # print the first 300 characters to check it worked
open(...) opens the file, f.read() returns its whole contents as one big string, and printing the first 300 characters confirms it loaded. Put any .txt file named notes.txt in your folder and run python load.py to test.
Loading a PDF File
PDFs store text in pages, so we read them page by page and join the results. Add this to load.py:
from pypdf import PdfReader
def load_pdf(path):
reader = PdfReader(path)
pages = [page.extract_text() or "" for page in reader.pages]
return "\n".join(pages)
text = load_pdf("lecture.pdf")
print(text[:300])
PdfReader opens the PDF, we pull the text out of each page with extract_text(), and we join all the pages into one string with newlines between them. The or "" guards against pages that have no extractable text (like a scanned image page) so the program does not crash.
Heads up: PDFs that are scanned images of pages contain no real text, so
extract_text()returns nothing for them. For this course, use PDFs with selectable text (the kind where you can highlight words in a PDF reader). Your own typed notes and most downloaded articles work great.
Why We Chunk
You now have a document as one long string. Why not just hand the whole thing to the model? Two reasons:
- Models have a limited reading window. They can only consider so much text at once. A 40-page PDF will not fit.
- Retrieval is more precise with small pieces. If your question is about one paragraph, you want to retrieve that paragraph, not the whole document. Small chunks mean the model gets focused, relevant context instead of a wall of text.
So we split the document into chunks, small overlapping windows of text. Each chunk gets its own embedding and can be retrieved on its own.
- One long document40 pages of text
- Split into chunks~800 characters each
- Many small piecesEach searchable alone
How Big Should a Chunk Be?
A good starting point for beginners is about 800 characters per chunk, with a 100-character overlap between neighbors. The overlap matters: if an important sentence sits right at the boundary between two chunks, the overlap makes sure it appears whole in at least one of them, instead of being cut in half.
There is no single perfect number. Smaller chunks are more precise but can lose context; larger chunks keep more context but are less focused. The 800/100 setting is a sensible default for notes and articles, and you can adjust it later.
A Simple Chunking Function
Here is a small, readable chunker. It walks through the text in steps, taking a window of chunk_size characters each time and moving forward by slightly less than that to create the overlap. Add it to your load.py:
def chunk_text(text, chunk_size=800, overlap=100):
assert overlap < chunk_size, "overlap must be smaller than chunk_size"
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap # step forward, leaving an overlap
return chunks
pieces = chunk_text(text)
print(f"Made {len(pieces)} chunks")
print("First chunk:\n", pieces[0])
text[start:end] grabs one window of characters. We add it to the list, then advance start by chunk_size - overlap so the next window overlaps the previous one. The loop ends when we reach the end of the text.
See Chunking in Action
You can run a tiny version of the same logic right here in your browser to build intuition. Notice how each chunk shares a few characters with the one before it.
Watch the last characters of one chunk reappear at the start of the next. That is the overlap doing its job, making sure no idea is lost at a boundary.
Putting It Together
At this point your load.py can: read a TXT or PDF into one string, then split that string into a list of overlapping chunks. That list of chunks is exactly what the next lesson feeds into Chroma to be embedded and stored. You have built the first half of the pipeline.
Key Takeaways
- Install the project tools with
pip install chromadb pypdf. - Read
.txtfiles withopen().read()and.pdffiles withpypdf'sPdfReaderandextract_text(). - Use PDFs with selectable text; scanned-image PDFs have no extractable text.
- Chunking splits a long document into small pieces so retrieval is precise and fits the model's reading window.
- A good beginner default is about 800 characters per chunk with 100 characters of overlap, so no idea is cut in half.
- The output is a list of chunks, ready to be embedded and stored in the next lesson.

