Load and Chunk Your Documents

Now we start working with your actual files. Before a document can be searched, it has to be read into your program and then split into small, bite-sized pieces. This lesson covers both: loading text from PDF and TXT files, and a simple, reliable way to chunk it. This is the first real code you will save and run.

What You'll Learn

How to set up a Python environment for the project
How to read text from a .txt file and a .pdf file
Why we split documents into chunks, and how big they should be
A simple chunking function you can copy and use

First: Set Up Your Project

Create a folder for the project and install the few libraries we need. In your terminal:

mkdir my-knowledge-base
cd my-knowledge-base
pip install chromadb pypdf

chromadb is the local database we will use to store and search embeddings (next lesson).
pypdf reads text out of PDF files.

We already have Ollama and the models from earlier, so this is everything else you need.

Loading a Text File

Reading a plain text file is the simplest case. Save this as load.py:

def load_txt(path):
    with open(path, "r", encoding="utf-8") as f:
        return f.read()

text = load_txt("notes.txt")
print(text[:300])   # print the first 300 characters to check it worked

open(...) opens the file, f.read() returns its whole contents as one big string, and printing the first 300 characters confirms it loaded. Put any .txt file named notes.txt in your folder and run python load.py to test.

Loading a PDF File

PDFs store text in pages, so we read them page by page and join the results. Add this to load.py:

from pypdf import PdfReader

def load_pdf(path):
    reader = PdfReader(path)
    pages = [page.extract_text() or "" for page in reader.pages]
    return "\n".join(pages)

text = load_pdf("lecture.pdf")
print(text[:300])

PdfReader opens the PDF, we pull the text out of each page with extract_text(), and we join all the pages into one string with newlines between them. The or "" guards against pages that have no extractable text (like a scanned image page) so the program does not crash.

Heads up: PDFs that are scanned images of pages contain no real text, so extract_text() returns nothing for them. For this course, use PDFs with selectable text (the kind where you can highlight words in a PDF reader). Your own typed notes and most downloaded articles work great.

Why We Chunk

You now have a document as one long string. Why not just hand the whole thing to the model? Two reasons:

Models have a limited reading window. They can only consider so much text at once. A 40-page PDF will not fit.
Retrieval is more precise with small pieces. If your question is about one paragraph, you want to retrieve that paragraph, not the whole document. Small chunks mean the model gets focused, relevant context instead of a wall of text.

So we split the document into chunks, small overlapping windows of text. Each chunk gets its own embedding and can be retrieved on its own.

One long document40 pages of text
Split into chunks~800 characters each
Many small piecesEach searchable alone

How Big Should a Chunk Be?

A good starting point for beginners is about 800 characters per chunk, with a 100-character overlap between neighbors. The overlap matters: if an important sentence sits right at the boundary between two chunks, the overlap makes sure it appears whole in at least one of them, instead of being cut in half.

There is no single perfect number. Smaller chunks are more precise but can lose context; larger chunks keep more context but are less focused. The 800/100 setting is a sensible default for notes and articles, and you can adjust it later.

A Simple Chunking Function

Here is a small, readable chunker. It walks through the text in steps, taking a window of chunk_size characters each time and moving forward by slightly less than that to create the overlap. Add it to your load.py:

def chunk_text(text, chunk_size=800, overlap=100):
    assert overlap < chunk_size, "overlap must be smaller than chunk_size"
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap   # step forward, leaving an overlap
    return chunks

pieces = chunk_text(text)
print(f"Made {len(pieces)} chunks")
print("First chunk:\n", pieces[0])

text[start:end] grabs one window of characters. We add it to the list, then advance start by chunk_size - overlap so the next window overlaps the previous one. The loop ends when we reach the end of the text.

See Chunking in Action

You can run a tiny version of the same logic right here in your browser to build intuition. Notice how each chunk shares a few characters with the one before it.

Loading Python Playground...

Watch the last characters of one chunk reappear at the start of the next. That is the overlap doing its job, making sure no idea is lost at a boundary.

Putting It Together

At this point your load.py can: read a TXT or PDF into one string, then split that string into a list of overlapping chunks. That list of chunks is exactly what the next lesson feeds into Chroma to be embedded and stored. You have built the first half of the pipeline.

Key Takeaways

Install the project tools with pip install chromadb pypdf.
Read .txt files with open().read() and .pdf files with pypdf's PdfReader and extract_text().
Use PDFs with selectable text; scanned-image PDFs have no extractable text.
Chunking splits a long document into small pieces so retrieval is precise and fits the model's reading window.
A good beginner default is about 800 characters per chunk with 100 characters of overlap, so no idea is cut in half.
The output is a list of chunks, ready to be embedded and stored in the next lesson.

Load and Chunk Your Documents

What You'll Learn

How to set up a Python environment for the project
How to read text from a .txt file and a .pdf file
Why we split documents into chunks, and how big they should be
A simple chunking function you can copy and use

First: Set Up Your Project

Create a folder for the project and install the few libraries we need. In your terminal:

mkdir my-knowledge-base
cd my-knowledge-base
pip install chromadb pypdf

chromadb is the local database we will use to store and search embeddings (next lesson).
pypdf reads text out of PDF files.

We already have Ollama and the models from earlier, so this is everything else you need.

Loading a Text File

Reading a plain text file is the simplest case. Save this as load.py:

def load_txt(path):
    with open(path, "r", encoding="utf-8") as f:
        return f.read()

text = load_txt("notes.txt")
print(text[:300])   # print the first 300 characters to check it worked

Loading a PDF File

PDFs store text in pages, so we read them page by page and join the results. Add this to load.py:

from pypdf import PdfReader

def load_pdf(path):
    reader = PdfReader(path)
    pages = [page.extract_text() or "" for page in reader.pages]
    return "\n".join(pages)

text = load_pdf("lecture.pdf")
print(text[:300])

Heads up: PDFs that are scanned images of pages contain no real text, so extract_text() returns nothing for them. For this course, use PDFs with selectable text (the kind where you can highlight words in a PDF reader). Your own typed notes and most downloaded articles work great.

Why We Chunk

You now have a document as one long string. Why not just hand the whole thing to the model? Two reasons:

Models have a limited reading window. They can only consider so much text at once. A 40-page PDF will not fit.
Retrieval is more precise with small pieces. If your question is about one paragraph, you want to retrieve that paragraph, not the whole document. Small chunks mean the model gets focused, relevant context instead of a wall of text.

So we split the document into chunks, small overlapping windows of text. Each chunk gets its own embedding and can be retrieved on its own.

One long document40 pages of text
Split into chunks~800 characters each
Many small piecesEach searchable alone

How Big Should a Chunk Be?

A Simple Chunking Function

def chunk_text(text, chunk_size=800, overlap=100):
    assert overlap < chunk_size, "overlap must be smaller than chunk_size"
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap   # step forward, leaving an overlap
    return chunks

pieces = chunk_text(text)
print(f"Made {len(pieces)} chunks")
print("First chunk:\n", pieces[0])

See Chunking in Action

You can run a tiny version of the same logic right here in your browser to build intuition. Notice how each chunk shares a few characters with the one before it.

Loading Python Playground...

Watch the last characters of one chunk reappear at the start of the next. That is the overlap doing its job, making sure no idea is lost at a boundary.

Putting It Together

Key Takeaways

Install the project tools with pip install chromadb pypdf.
Read .txt files with open().read() and .pdf files with pypdf's PdfReader and extract_text().
Use PDFs with selectable text; scanned-image PDFs have no extractable text.
Chunking splits a long document into small pieces so retrieval is precise and fits the model's reading window.
A good beginner default is about 800 characters per chunk with 100 characters of overlap, so no idea is cut in half.
The output is a list of chunks, ready to be embedded and stored in the next lesson.

Load and Chunk Your Documents

What You'll Learn

First: Set Up Your Project

Loading a Text File

Loading a PDF File

Why We Chunk

How Big Should a Chunk Be?

A Simple Chunking Function

See Chunking in Action

Putting It Together

Key Takeaways

Quiz

Questions & Answers

Load and Chunk Your Documents

What You'll Learn

First: Set Up Your Project

Loading a Text File

Loading a PDF File

Why We Chunk

How Big Should a Chunk Be?

A Simple Chunking Function

See Chunking in Action

Putting It Together

Key Takeaways

Quiz

Questions & Answers