Document Preparation: The Art of Chunking
Introduction
Before we can search our knowledge base, we need to build it. The first step is document preparation—taking raw documents and transforming them into chunks that are suitable for embedding and retrieval.
This process is often called "chunking," and it's more art than science. The choices you make here fundamentally affect your RAG system's quality. Poor chunking leads to poor retrieval, which leads to poor answers.
In this lesson, you'll learn why chunking matters, explore different chunking strategies, and understand how to choose the right approach for your use case.
The Data Ingestion Challenge
Diverse Document Formats
Real-world knowledge bases contain diverse document types:
Structured Documents:
- Markdown files with clear headings
- HTML pages with semantic tags
- JSON/YAML configuration with schemas
Semi-Structured Documents:
- PDFs with mixed layouts
- Word documents with inconsistent formatting
- Spreadsheets with embedded text
Unstructured Documents:
- Plain text files
- Scanned documents (OCR)
- Email archives
Each format requires different handling, but they all need to end up as clean text chunks with meaningful boundaries.
The Document Loading Pipeline
Before chunking, documents must be loaded and cleaned:
Raw File → Parse/Extract → Clean Text → Chunked Text → Ready for Embedding
Parsing Examples:
// Markdown: Usually clean, minimal processing needed
const markdown = await fs.readFile('docs/guide.md', 'utf-8');
// PDF: Requires extraction library
import { extractText } from 'pdf-lib';
const pdfText = await extractText('manual.pdf');
// HTML: Strip tags, preserve structure
import { JSDOM } from 'jsdom';
const dom = new JSDOM(html);
const text = dom.window.document.body.textContent;
Cleaning Operations:
- Remove excessive whitespace
- Fix encoding issues
- Strip irrelevant content (headers, footers, navigation)
- Normalize formatting
Why Chunking Matters
Context Window Limits
LLMs have a maximum context window—the amount of text they can process at once. While modern models have large windows (128K+ tokens), there are good reasons not to fill them:
Cost: API pricing is often per-token. Stuffing the context with irrelevant text wastes money.
Quality: More context isn't always better. Irrelevant information can confuse the model and lead to worse answers.
Speed: Larger contexts take longer to process.
Retrieval Granularity
Chunking determines the granularity of your retrieval. Consider searching for "how to configure SSL":
Entire Document as One Chunk:
- You retrieve a 50-page document
- 49 pages are irrelevant
- Context is diluted
Paragraphs as Chunks:
- You retrieve the specific paragraph about SSL
- Context is focused
- LLM can answer precisely
Too Small (Sentences):
- You retrieve fragments
- Missing context needed for complete answer
- LLM struggles to synthesize
The Goldilocks Problem
Chunks must be:
Large enough to contain complete, coherent information Small enough to be specific and focused
This balance depends on:
- The nature of your content
- The types of questions users ask
- Your embedding model's capabilities
Chunking Strategies
Let's explore the main approaches, from simple to sophisticated.
Fixed-Size Chunking
Approach: Split text at fixed character/token counts.
function fixedSizeChunk(text: string, chunkSize: number): string[] {
const chunks: string[] = [];
for (let i = 0; i < text.length; i += chunkSize) {
chunks.push(text.slice(i, i + chunkSize));
}
return chunks;
}
// Example
const chunks = fixedSizeChunk(document, 1000); // 1000 characters each
Pros:
- Simple to implement
- Predictable chunk sizes
- Easy to reason about
Cons:
- Ignores semantic boundaries
- Can split mid-sentence
- Related content may span chunks
Best for:
- Uniform, flowing text
- Quick prototyping
- When semantic structure is unclear
Fixed-Size with Overlap
Approach: Fixed-size chunks with overlapping content between adjacent chunks.
function fixedSizeWithOverlap(
text: string,
chunkSize: number,
overlap: number
): string[] {
const chunks: string[] = [];
const step = chunkSize - overlap;
for (let i = 0; i < text.length; i += step) {
chunks.push(text.slice(i, i + chunkSize));
if (i + chunkSize >= text.length) break;
}
return chunks;
}
// Example: 1000 char chunks with 200 char overlap
const chunks = fixedSizeWithOverlap(document, 1000, 200);
Why overlap?
Overlap ensures that content near chunk boundaries isn't orphaned:
Chunk 1: [........context A........|overlap|]
Chunk 2: [|overlap|........context B........]
If a question relates to the overlap region, both chunks might be retrieved, providing complete context.
Typical overlap: 10-20% of chunk size (e.g., 200 characters for 1000-character chunks)
Best for:
- Improving retrieval near boundaries
- Most general-purpose chunking
Recursive Character Splitting
Approach: Split on semantic boundaries (paragraphs, sentences, words) recursively until chunks meet size requirements.
function recursiveSplit(
text: string,
maxSize: number,
separators: string[] = ['\n\n', '\n', '. ', ' ']
): string[] {
if (text.length <= maxSize) {
return [text];
}
for (const separator of separators) {
const parts = text.split(separator);
if (parts.length > 1) {
const chunks: string[] = [];
let currentChunk = '';
for (const part of parts) {
const addition = currentChunk ? separator + part : part;
if ((currentChunk + addition).length <= maxSize) {
currentChunk += addition;
} else {
if (currentChunk) chunks.push(currentChunk);
currentChunk = part;
}
}
if (currentChunk) chunks.push(currentChunk);
return chunks.flatMap(chunk =>
chunk.length > maxSize
? recursiveSplit(chunk, maxSize, separators.slice(1))
: [chunk]
);
}
}
// Fallback: force split
return fixedSizeChunk(text, maxSize);
}
How it works:
- Try splitting on paragraph breaks (
\n\n) - If chunks still too large, split on line breaks (
\n) - Then sentence boundaries (
.) - Then word boundaries (
) - Last resort: character-level split
Pros:
- Respects semantic boundaries
- Flexible sizing
- Widely used and well-understood
Cons:
- More complex implementation
- Results vary with input structure
Best for:
- Documentation with clear structure
- Mixed content types
- Most production systems
Semantic Chunking
Approach: Use embedding similarity to find natural topic boundaries.
Concept:
- Split into sentences
- Embed each sentence
- Compare adjacent sentence embeddings
- Large similarity drops indicate topic changes
- Group sentences between drops into chunks
// Conceptual implementation
async function semanticChunk(text: string): Promise<string[]> {
// 1. Split into sentences
const sentences = text.match(/[^.!?]+[.!?]+/g) || [];
// 2. Embed each sentence
const embeddings = await Promise.all(
sentences.map(s => embedText(s))
);
// 3. Find similarity between adjacent sentences
const similarities: number[] = [];
for (let i = 0; i < embeddings.length - 1; i++) {
similarities.push(
cosineSimilarity(embeddings[i], embeddings[i + 1])
);
}
// 4. Find breakpoints (low similarity = topic change)
const threshold = calculateThreshold(similarities);
const breakpoints = similarities
.map((sim, i) => sim < threshold ? i + 1 : -1)
.filter(i => i !== -1);
// 5. Group sentences into chunks
const chunks: string[] = [];
let start = 0;
for (const breakpoint of breakpoints) {
chunks.push(sentences.slice(start, breakpoint).join(' '));
start = breakpoint;
}
chunks.push(sentences.slice(start).join(' '));
return chunks;
}
Pros:
- True semantic boundaries
- Optimal for topical content
- Chunks are coherent units of meaning
Cons:
- Expensive (requires embedding each sentence)
- Complex implementation
- May produce variable chunk sizes
Best for:
- Long-form content with topic shifts
- Academic papers
- Books and lengthy documentation
Structure-Aware Chunking
Approach: Use document structure (headings, sections) to define chunk boundaries.
// For Markdown
function markdownChunk(markdown: string): Chunk[] {
const chunks: Chunk[] = [];
const sections = markdown.split(/^#{1,3}\s+/m);
for (const section of sections) {
const lines = section.split('\n');
const title = lines[0];
const content = lines.slice(1).join('\n').trim();
if (content.length > 0) {
chunks.push({
title,
content,
source: 'docs'
});
}
}
return chunks;
}
For HTML:
function htmlChunk(html: string): Chunk[] {
const dom = new JSDOM(html);
const chunks: Chunk[] = [];
// Chunk by semantic HTML elements
const sections = dom.window.document.querySelectorAll(
'section, article, .content-block'
);
sections.forEach(section => {
chunks.push({
title: section.querySelector('h1, h2, h3')?.textContent || '',
content: section.textContent?.trim() || ''
});
});
return chunks;
}
Pros:
- Leverages author's structure
- Natural boundaries
- Preserves section context
Cons:
- Requires structured input
- Chunk sizes can vary widely
- Not all documents have structure
Best for:
- Documentation with clear hierarchy
- Structured content (HTML, Markdown)
- API references
Choosing Your Strategy
Decision Framework
Ask these questions to choose a chunking strategy:
1. What does your content look like?
| Content Type | Recommended Strategy |
|---|---|
| Markdown documentation | Structure-aware |
| Plain text articles | Recursive character |
| Technical manuals | Structure-aware + size limits |
| Transcripts/logs | Fixed-size with overlap |
| Mixed formats | Recursive character (most flexible) |
2. How long are typical documents?
| Document Size | Consideration |
|---|---|
| < 1000 tokens | May not need chunking |
| 1000-5000 tokens | Simple strategies work |
| > 5000 tokens | Consider semantic boundaries |
3. What questions will users ask?
| Question Type | Chunk Strategy |
|---|---|
| Specific facts | Smaller chunks (300-500 tokens) |
| Conceptual explanations | Medium chunks (500-1000 tokens) |
| Procedural how-tos | Structure-aware (preserve steps) |
Recommended Starting Point
For most applications, start with:
const config = {
strategy: 'recursive',
maxChunkSize: 800, // tokens (roughly 600 words)
overlap: 100, // tokens
separators: ['\n\n', '\n', '. ', ' ']
};
Then iterate based on retrieval quality.
Chunk Metadata
Effective chunking isn't just about the text—it's about preserving context through metadata.
Essential Metadata Fields
interface ChunkMetadata {
// Source tracking
source: string; // File name or URL
title: string; // Document/section title
// Position tracking
chunkIndex: number; // Position in document
totalChunks: number; // Total chunks from document
// Hierarchy (if structure-aware)
parentSection?: string; // Parent heading
headingLevel?: number; // H1, H2, H3, etc.
// Timestamps
createdAt: Date;
documentDate?: Date; // Original document date
// Custom fields
category?: string;
tags?: string[];
}
Why Metadata Matters
Attribution: When the LLM generates a response, metadata lets you show "Source: getting-started.md, Section: Installation"
Filtering: Metadata enables scoped search: "Search only in the API documentation"
Context Reconstruction: Knowing chunk position helps retrieve surrounding chunks if needed:
// If chunk 5 is relevant, also fetch chunks 4 and 6
const surroundingChunks = await fetchChunks([
chunkIndex - 1,
chunkIndex,
chunkIndex + 1
]);
Parent-Child Relationships
For hierarchical documents, consider storing parent context:
interface HierarchicalChunk {
id: string;
content: string;
parentId?: string;
parentContent?: string; // Store summarized parent content
children?: string[];
}
This enables:
- Retrieving a chunk with its context
- Navigating document hierarchy
- More sophisticated retrieval strategies
Summary
In this lesson, we explored the critical process of document preparation and chunking:
Key Takeaways:
-
Chunking quality directly affects RAG quality: Poor chunks = poor retrieval = poor answers
-
There's no one-size-fits-all strategy: Choose based on content type, document size, and query patterns
-
Recursive character splitting is a solid default: Works well for most content types
-
Overlap prevents information loss: Use 10-20% overlap to ensure boundary content is preserved
-
Metadata is crucial: Source, title, and position enable attribution and filtering
-
Start simple, iterate: Begin with a reasonable strategy and refine based on actual retrieval quality
Next Steps
In the next lesson, we'll take our chunks and convert them into vectors. You'll learn the complete vectorization and storage process—from generating embeddings to storing them efficiently in Supabase with pgvector.
"The way you organize information determines the way you can retrieve it." — Unknown

