Document Preparation: The Art of Chunking

Introduction

Before we can search our knowledge base, we need to build it. The first step is document preparation—taking raw documents and transforming them into chunks that are suitable for embedding and retrieval.

This process is often called "chunking," and it's more art than science. The choices you make here fundamentally affect your RAG system's quality. Poor chunking leads to poor retrieval, which leads to poor answers.

In this lesson, you'll learn why chunking matters, explore different chunking strategies, and understand how to choose the right approach for your use case.

The Data Ingestion Challenge

Diverse Document Formats

Real-world knowledge bases contain diverse document types:

Structured Documents:

Markdown files with clear headings
HTML pages with semantic tags
JSON/YAML configuration with schemas

Semi-Structured Documents:

PDFs with mixed layouts
Word documents with inconsistent formatting
Spreadsheets with embedded text

Unstructured Documents:

Plain text files
Scanned documents (OCR)
Email archives

Each format requires different handling, but they all need to end up as clean text chunks with meaningful boundaries.

The Document Loading Pipeline

Before chunking, documents must be loaded and cleaned:

Raw File → Parse/Extract → Clean Text → Chunked Text → Ready for Embedding

Parsing Examples:

// Markdown: Usually clean, minimal processing needed
const markdown = await fs.readFile('docs/guide.md', 'utf-8');

// PDF: Requires extraction library
import { extractText } from 'pdf-lib';
const pdfText = await extractText('manual.pdf');

// HTML: Strip tags, preserve structure
import { JSDOM } from 'jsdom';
const dom = new JSDOM(html);
const text = dom.window.document.body.textContent;

Cleaning Operations:

Remove excessive whitespace
Fix encoding issues
Strip irrelevant content (headers, footers, navigation)
Normalize formatting

Why Chunking Matters

Context Window Limits

LLMs have a maximum context window—the amount of text they can process at once. While modern models have large windows (128K+ tokens), there are good reasons not to fill them:

Cost: API pricing is often per-token. Stuffing the context with irrelevant text wastes money.

Quality: More context isn't always better. Irrelevant information can confuse the model and lead to worse answers.

Speed: Larger contexts take longer to process.

Retrieval Granularity

Chunking determines the granularity of your retrieval. Consider searching for "how to configure SSL":

Entire Document as One Chunk:

You retrieve a 50-page document
49 pages are irrelevant
Context is diluted

Paragraphs as Chunks:

You retrieve the specific paragraph about SSL
Context is focused
LLM can answer precisely

Too Small (Sentences):

You retrieve fragments
Missing context needed for complete answer
LLM struggles to synthesize

The Goldilocks Problem

Chunks must be:

Large enough to contain complete, coherent information Small enough to be specific and focused

This balance depends on:

The nature of your content
The types of questions users ask
Your embedding model's capabilities

Chunking Strategies

Let's explore the main approaches, from simple to sophisticated.

Fixed-Size Chunking

Approach: Split text at fixed character/token counts.

function fixedSizeChunk(text: string, chunkSize: number): string[] {
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += chunkSize) {
    chunks.push(text.slice(i, i + chunkSize));
  }
  return chunks;
}

// Example
const chunks = fixedSizeChunk(document, 1000); // 1000 characters each

Pros:

Simple to implement
Predictable chunk sizes
Easy to reason about

Cons:

Ignores semantic boundaries
Can split mid-sentence
Related content may span chunks

Best for:

Uniform, flowing text
Quick prototyping
When semantic structure is unclear

Fixed-Size with Overlap

Approach: Fixed-size chunks with overlapping content between adjacent chunks.

function fixedSizeWithOverlap(
  text: string,
  chunkSize: number,
  overlap: number
): string[] {
  const chunks: string[] = [];
  const step = chunkSize - overlap;

  for (let i = 0; i < text.length; i += step) {
    chunks.push(text.slice(i, i + chunkSize));
    if (i + chunkSize >= text.length) break;
  }
  return chunks;
}

// Example: 1000 char chunks with 200 char overlap
const chunks = fixedSizeWithOverlap(document, 1000, 200);

Why overlap?

Overlap ensures that content near chunk boundaries isn't orphaned:

Chunk 1: [........context A........|overlap|]
Chunk 2:                   [|overlap|........context B........]

If a question relates to the overlap region, both chunks might be retrieved, providing complete context.

Typical overlap: 10-20% of chunk size (e.g., 200 characters for 1000-character chunks)

Best for:

Improving retrieval near boundaries
Most general-purpose chunking

Recursive Character Splitting

Approach: Split on semantic boundaries (paragraphs, sentences, words) recursively until chunks meet size requirements.

function recursiveSplit(
  text: string,
  maxSize: number,
  separators: string[] = ['\n\n', '\n', '. ', ' ']
): string[] {
  if (text.length <= maxSize) {
    return [text];
  }

  for (const separator of separators) {
    const parts = text.split(separator);
    if (parts.length > 1) {
      const chunks: string[] = [];
      let currentChunk = '';

      for (const part of parts) {
        const addition = currentChunk ? separator + part : part;
        if ((currentChunk + addition).length <= maxSize) {
          currentChunk += addition;
        } else {
          if (currentChunk) chunks.push(currentChunk);
          currentChunk = part;
        }
      }
      if (currentChunk) chunks.push(currentChunk);

      return chunks.flatMap(chunk =>
        chunk.length > maxSize
          ? recursiveSplit(chunk, maxSize, separators.slice(1))
          : [chunk]
      );
    }
  }

  // Fallback: force split
  return fixedSizeChunk(text, maxSize);
}

How it works:

Try splitting on paragraph breaks (\n\n)
If chunks still too large, split on line breaks (\n)
Then sentence boundaries (. )
Then word boundaries ( )
Last resort: character-level split

Pros:

Respects semantic boundaries
Flexible sizing
Widely used and well-understood

Cons:

More complex implementation
Results vary with input structure

Best for:

Documentation with clear structure
Mixed content types
Most production systems

Semantic Chunking

Approach: Use embedding similarity to find natural topic boundaries.

Concept:

Split into sentences
Embed each sentence
Compare adjacent sentence embeddings
Large similarity drops indicate topic changes
Group sentences between drops into chunks

// Conceptual implementation
async function semanticChunk(text: string): Promise<string[]> {
  // 1. Split into sentences
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [];

  // 2. Embed each sentence
  const embeddings = await Promise.all(
    sentences.map(s => embedText(s))
  );

  // 3. Find similarity between adjacent sentences
  const similarities: number[] = [];
  for (let i = 0; i < embeddings.length - 1; i++) {
    similarities.push(
      cosineSimilarity(embeddings[i], embeddings[i + 1])
    );
  }

  // 4. Find breakpoints (low similarity = topic change)
  const threshold = calculateThreshold(similarities);
  const breakpoints = similarities
    .map((sim, i) => sim < threshold ? i + 1 : -1)
    .filter(i => i !== -1);

  // 5. Group sentences into chunks
  const chunks: string[] = [];
  let start = 0;
  for (const breakpoint of breakpoints) {
    chunks.push(sentences.slice(start, breakpoint).join(' '));
    start = breakpoint;
  }
  chunks.push(sentences.slice(start).join(' '));

  return chunks;
}

Pros:

True semantic boundaries
Optimal for topical content
Chunks are coherent units of meaning

Cons:

Expensive (requires embedding each sentence)
Complex implementation
May produce variable chunk sizes

Best for:

Long-form content with topic shifts
Academic papers
Books and lengthy documentation

Structure-Aware Chunking

Approach: Use document structure (headings, sections) to define chunk boundaries.

// For Markdown
function markdownChunk(markdown: string): Chunk[] {
  const chunks: Chunk[] = [];
  const sections = markdown.split(/^#{1,3}\s+/m);

  for (const section of sections) {
    const lines = section.split('\n');
    const title = lines[0];
    const content = lines.slice(1).join('\n').trim();

    if (content.length > 0) {
      chunks.push({
        title,
        content,
        source: 'docs'
      });
    }
  }

  return chunks;
}

For HTML:

function htmlChunk(html: string): Chunk[] {
  const dom = new JSDOM(html);
  const chunks: Chunk[] = [];

  // Chunk by semantic HTML elements
  const sections = dom.window.document.querySelectorAll(
    'section, article, .content-block'
  );

  sections.forEach(section => {
    chunks.push({
      title: section.querySelector('h1, h2, h3')?.textContent || '',
      content: section.textContent?.trim() || ''
    });
  });

  return chunks;
}

Pros:

Leverages author's structure
Natural boundaries
Preserves section context

Cons:

Requires structured input
Chunk sizes can vary widely
Not all documents have structure

Best for:

Documentation with clear hierarchy
Structured content (HTML, Markdown)
API references

Choosing Your Strategy

Decision Framework

Ask these questions to choose a chunking strategy:

1. What does your content look like?

Content Type	Recommended Strategy
Markdown documentation	Structure-aware
Plain text articles	Recursive character
Technical manuals	Structure-aware + size limits
Transcripts/logs	Fixed-size with overlap
Mixed formats	Recursive character (most flexible)

2. How long are typical documents?

Document Size	Consideration
< 1000 tokens	May not need chunking
1000-5000 tokens	Simple strategies work
> 5000 tokens	Consider semantic boundaries

3. What questions will users ask?

Question Type	Chunk Strategy
Specific facts	Smaller chunks (300-500 tokens)
Conceptual explanations	Medium chunks (500-1000 tokens)
Procedural how-tos	Structure-aware (preserve steps)

Recommended Starting Point

For most applications, start with:

const config = {
  strategy: 'recursive',
  maxChunkSize: 800,  // tokens (roughly 600 words)
  overlap: 100,       // tokens
  separators: ['\n\n', '\n', '. ', ' ']
};

Then iterate based on retrieval quality.

Chunk Metadata

Effective chunking isn't just about the text—it's about preserving context through metadata.

Essential Metadata Fields

interface ChunkMetadata {
  // Source tracking
  source: string;           // File name or URL
  title: string;            // Document/section title

  // Position tracking
  chunkIndex: number;       // Position in document
  totalChunks: number;      // Total chunks from document

  // Hierarchy (if structure-aware)
  parentSection?: string;   // Parent heading
  headingLevel?: number;    // H1, H2, H3, etc.

  // Timestamps
  createdAt: Date;
  documentDate?: Date;      // Original document date

  // Custom fields
  category?: string;
  tags?: string[];
}

Why Metadata Matters

Attribution: When the LLM generates a response, metadata lets you show "Source: getting-started.md, Section: Installation"

Filtering: Metadata enables scoped search: "Search only in the API documentation"

Context Reconstruction: Knowing chunk position helps retrieve surrounding chunks if needed:

// If chunk 5 is relevant, also fetch chunks 4 and 6
const surroundingChunks = await fetchChunks([
  chunkIndex - 1,
  chunkIndex,
  chunkIndex + 1
]);

Parent-Child Relationships

For hierarchical documents, consider storing parent context:

interface HierarchicalChunk {
  id: string;
  content: string;
  parentId?: string;
  parentContent?: string;  // Store summarized parent content
  children?: string[];
}

This enables:

Retrieving a chunk with its context
Navigating document hierarchy
More sophisticated retrieval strategies

Summary

In this lesson, we explored the critical process of document preparation and chunking:

Key Takeaways:

Chunking quality directly affects RAG quality: Poor chunks = poor retrieval = poor answers
There's no one-size-fits-all strategy: Choose based on content type, document size, and query patterns
Recursive character splitting is a solid default: Works well for most content types
Overlap prevents information loss: Use 10-20% overlap to ensure boundary content is preserved
Metadata is crucial: Source, title, and position enable attribution and filtering
Start simple, iterate: Begin with a reasonable strategy and refine based on actual retrieval quality

Next Steps

In the next lesson, we'll take our chunks and convert them into vectors. You'll learn the complete vectorization and storage process—from generating embeddings to storing them efficiently in Supabase with pgvector.

"The way you organize information determines the way you can retrieve it." — Unknown

Document Preparation: The Art of Chunking

Introduction

In this lesson, you'll learn why chunking matters, explore different chunking strategies, and understand how to choose the right approach for your use case.

The Data Ingestion Challenge

Diverse Document Formats

Real-world knowledge bases contain diverse document types:

Structured Documents:

Markdown files with clear headings
HTML pages with semantic tags
JSON/YAML configuration with schemas

Semi-Structured Documents:

PDFs with mixed layouts
Word documents with inconsistent formatting
Spreadsheets with embedded text

Unstructured Documents:

Plain text files
Scanned documents (OCR)
Email archives

Each format requires different handling, but they all need to end up as clean text chunks with meaningful boundaries.

The Document Loading Pipeline

Before chunking, documents must be loaded and cleaned:

Raw File → Parse/Extract → Clean Text → Chunked Text → Ready for Embedding

Parsing Examples:

// Markdown: Usually clean, minimal processing needed
const markdown = await fs.readFile('docs/guide.md', 'utf-8');

// PDF: Requires extraction library
import { extractText } from 'pdf-lib';
const pdfText = await extractText('manual.pdf');

// HTML: Strip tags, preserve structure
import { JSDOM } from 'jsdom';
const dom = new JSDOM(html);
const text = dom.window.document.body.textContent;

Cleaning Operations:

Remove excessive whitespace
Fix encoding issues
Strip irrelevant content (headers, footers, navigation)
Normalize formatting

Why Chunking Matters

Context Window Limits

LLMs have a maximum context window—the amount of text they can process at once. While modern models have large windows (128K+ tokens), there are good reasons not to fill them:

Cost: API pricing is often per-token. Stuffing the context with irrelevant text wastes money.

Quality: More context isn't always better. Irrelevant information can confuse the model and lead to worse answers.

Speed: Larger contexts take longer to process.

Retrieval Granularity

Chunking determines the granularity of your retrieval. Consider searching for "how to configure SSL":

Entire Document as One Chunk:

You retrieve a 50-page document
49 pages are irrelevant
Context is diluted

Paragraphs as Chunks:

You retrieve the specific paragraph about SSL
Context is focused
LLM can answer precisely

Too Small (Sentences):

You retrieve fragments
Missing context needed for complete answer
LLM struggles to synthesize

The Goldilocks Problem

Chunks must be:

Large enough to contain complete, coherent information Small enough to be specific and focused

This balance depends on:

The nature of your content
The types of questions users ask
Your embedding model's capabilities

Chunking Strategies

Let's explore the main approaches, from simple to sophisticated.

Fixed-Size Chunking

Approach: Split text at fixed character/token counts.

function fixedSizeChunk(text: string, chunkSize: number): string[] {
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += chunkSize) {
    chunks.push(text.slice(i, i + chunkSize));
  }
  return chunks;
}

// Example
const chunks = fixedSizeChunk(document, 1000); // 1000 characters each

Pros:

Simple to implement
Predictable chunk sizes
Easy to reason about

Cons:

Ignores semantic boundaries
Can split mid-sentence
Related content may span chunks

Best for:

Uniform, flowing text
Quick prototyping
When semantic structure is unclear

Fixed-Size with Overlap

Approach: Fixed-size chunks with overlapping content between adjacent chunks.

function fixedSizeWithOverlap(
  text: string,
  chunkSize: number,
  overlap: number
): string[] {
  const chunks: string[] = [];
  const step = chunkSize - overlap;

  for (let i = 0; i < text.length; i += step) {
    chunks.push(text.slice(i, i + chunkSize));
    if (i + chunkSize >= text.length) break;
  }
  return chunks;
}

// Example: 1000 char chunks with 200 char overlap
const chunks = fixedSizeWithOverlap(document, 1000, 200);

Why overlap?

Overlap ensures that content near chunk boundaries isn't orphaned:

Chunk 1: [........context A........|overlap|]
Chunk 2:                   [|overlap|........context B........]

If a question relates to the overlap region, both chunks might be retrieved, providing complete context.

Typical overlap: 10-20% of chunk size (e.g., 200 characters for 1000-character chunks)

Best for:

Improving retrieval near boundaries
Most general-purpose chunking

Recursive Character Splitting

Approach: Split on semantic boundaries (paragraphs, sentences, words) recursively until chunks meet size requirements.

function recursiveSplit(
  text: string,
  maxSize: number,
  separators: string[] = ['\n\n', '\n', '. ', ' ']
): string[] {
  if (text.length <= maxSize) {
    return [text];
  }

  for (const separator of separators) {
    const parts = text.split(separator);
    if (parts.length > 1) {
      const chunks: string[] = [];
      let currentChunk = '';

      for (const part of parts) {
        const addition = currentChunk ? separator + part : part;
        if ((currentChunk + addition).length <= maxSize) {
          currentChunk += addition;
        } else {
          if (currentChunk) chunks.push(currentChunk);
          currentChunk = part;
        }
      }
      if (currentChunk) chunks.push(currentChunk);

      return chunks.flatMap(chunk =>
        chunk.length > maxSize
          ? recursiveSplit(chunk, maxSize, separators.slice(1))
          : [chunk]
      );
    }
  }

  // Fallback: force split
  return fixedSizeChunk(text, maxSize);
}

How it works:

Try splitting on paragraph breaks (\n\n)
If chunks still too large, split on line breaks (\n)
Then sentence boundaries (. )
Then word boundaries ( )
Last resort: character-level split

Pros:

Respects semantic boundaries
Flexible sizing
Widely used and well-understood

Cons:

More complex implementation
Results vary with input structure

Best for:

Documentation with clear structure
Mixed content types
Most production systems

Semantic Chunking

Approach: Use embedding similarity to find natural topic boundaries.

Concept:

Split into sentences
Embed each sentence
Compare adjacent sentence embeddings
Large similarity drops indicate topic changes
Group sentences between drops into chunks

// Conceptual implementation
async function semanticChunk(text: string): Promise<string[]> {
  // 1. Split into sentences
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [];

  // 2. Embed each sentence
  const embeddings = await Promise.all(
    sentences.map(s => embedText(s))
  );

  // 3. Find similarity between adjacent sentences
  const similarities: number[] = [];
  for (let i = 0; i < embeddings.length - 1; i++) {
    similarities.push(
      cosineSimilarity(embeddings[i], embeddings[i + 1])
    );
  }

  // 4. Find breakpoints (low similarity = topic change)
  const threshold = calculateThreshold(similarities);
  const breakpoints = similarities
    .map((sim, i) => sim < threshold ? i + 1 : -1)
    .filter(i => i !== -1);

  // 5. Group sentences into chunks
  const chunks: string[] = [];
  let start = 0;
  for (const breakpoint of breakpoints) {
    chunks.push(sentences.slice(start, breakpoint).join(' '));
    start = breakpoint;
  }
  chunks.push(sentences.slice(start).join(' '));

  return chunks;
}

Pros:

True semantic boundaries
Optimal for topical content
Chunks are coherent units of meaning

Cons:

Expensive (requires embedding each sentence)
Complex implementation
May produce variable chunk sizes

Best for:

Long-form content with topic shifts
Academic papers
Books and lengthy documentation

Structure-Aware Chunking

Approach: Use document structure (headings, sections) to define chunk boundaries.

// For Markdown
function markdownChunk(markdown: string): Chunk[] {
  const chunks: Chunk[] = [];
  const sections = markdown.split(/^#{1,3}\s+/m);

  for (const section of sections) {
    const lines = section.split('\n');
    const title = lines[0];
    const content = lines.slice(1).join('\n').trim();

    if (content.length > 0) {
      chunks.push({
        title,
        content,
        source: 'docs'
      });
    }
  }

  return chunks;
}

For HTML:

function htmlChunk(html: string): Chunk[] {
  const dom = new JSDOM(html);
  const chunks: Chunk[] = [];

  // Chunk by semantic HTML elements
  const sections = dom.window.document.querySelectorAll(
    'section, article, .content-block'
  );

  sections.forEach(section => {
    chunks.push({
      title: section.querySelector('h1, h2, h3')?.textContent || '',
      content: section.textContent?.trim() || ''
    });
  });

  return chunks;
}

Pros:

Leverages author's structure
Natural boundaries
Preserves section context

Cons:

Requires structured input
Chunk sizes can vary widely
Not all documents have structure

Best for:

Documentation with clear hierarchy
Structured content (HTML, Markdown)
API references

Choosing Your Strategy

Decision Framework

Ask these questions to choose a chunking strategy:

1. What does your content look like?

Content Type	Recommended Strategy
Markdown documentation	Structure-aware
Plain text articles	Recursive character
Technical manuals	Structure-aware + size limits
Transcripts/logs	Fixed-size with overlap
Mixed formats	Recursive character (most flexible)

2. How long are typical documents?

Document Size	Consideration
< 1000 tokens	May not need chunking
1000-5000 tokens	Simple strategies work
> 5000 tokens	Consider semantic boundaries

3. What questions will users ask?

Question Type	Chunk Strategy
Specific facts	Smaller chunks (300-500 tokens)
Conceptual explanations	Medium chunks (500-1000 tokens)
Procedural how-tos	Structure-aware (preserve steps)

Recommended Starting Point

For most applications, start with:

const config = {
  strategy: 'recursive',
  maxChunkSize: 800,  // tokens (roughly 600 words)
  overlap: 100,       // tokens
  separators: ['\n\n', '\n', '. ', ' ']
};

Then iterate based on retrieval quality.

Chunk Metadata

Effective chunking isn't just about the text—it's about preserving context through metadata.

Essential Metadata Fields

interface ChunkMetadata {
  // Source tracking
  source: string;           // File name or URL
  title: string;            // Document/section title

  // Position tracking
  chunkIndex: number;       // Position in document
  totalChunks: number;      // Total chunks from document

  // Hierarchy (if structure-aware)
  parentSection?: string;   // Parent heading
  headingLevel?: number;    // H1, H2, H3, etc.

  // Timestamps
  createdAt: Date;
  documentDate?: Date;      // Original document date

  // Custom fields
  category?: string;
  tags?: string[];
}

Why Metadata Matters

Attribution: When the LLM generates a response, metadata lets you show "Source: getting-started.md, Section: Installation"

Filtering: Metadata enables scoped search: "Search only in the API documentation"

Context Reconstruction: Knowing chunk position helps retrieve surrounding chunks if needed:

// If chunk 5 is relevant, also fetch chunks 4 and 6
const surroundingChunks = await fetchChunks([
  chunkIndex - 1,
  chunkIndex,
  chunkIndex + 1
]);

Parent-Child Relationships

For hierarchical documents, consider storing parent context:

interface HierarchicalChunk {
  id: string;
  content: string;
  parentId?: string;
  parentContent?: string;  // Store summarized parent content
  children?: string[];
}

This enables:

Retrieving a chunk with its context
Navigating document hierarchy
More sophisticated retrieval strategies

Summary

In this lesson, we explored the critical process of document preparation and chunking:

Key Takeaways:

Chunking quality directly affects RAG quality: Poor chunks = poor retrieval = poor answers
There's no one-size-fits-all strategy: Choose based on content type, document size, and query patterns
Recursive character splitting is a solid default: Works well for most content types
Overlap prevents information loss: Use 10-20% overlap to ensure boundary content is preserved
Metadata is crucial: Source, title, and position enable attribution and filtering
Start simple, iterate: Begin with a reasonable strategy and refine based on actual retrieval quality

Next Steps

"The way you organize information determines the way you can retrieve it." — Unknown