Module 7: Setting Up Chroma (Local/Free)

The Developer-Friendly Vector Database

Introduction

Chroma is an open-source vector database designed for developer experience. It's perfect for local development, prototyping, and small-to-medium production deployments.

By the end of this module, you'll have:

Chroma running locally
Working code for the JavaScript client
Understanding of when to use Chroma

7.1 Why Chroma?

Advantages

Dead simple to start: Single line to get running
Embedded mode: Runs in-process, no server needed
Persistent storage: Data survives restarts
Free and open source: No costs, full control
Great documentation: Well-maintained docs

Limitations

No managed cloud: You host it yourself
Scaling limits: Not designed for massive scale
Fewer features: Less advanced than Pinecone/Qdrant
Single-node only: No built-in clustering

When to Choose Chroma

Local development and testing
Prototyping AI features
Small production datasets (< 1M vectors)
When you want zero infrastructure
Learning vector databases

7.2 Installation Options

Option 1: Embedded (Easiest)

No server needed—Chroma runs in your process:

npm install chromadb chromadb-default-embed

import { ChromaClient } from 'chromadb'

// Chroma runs embedded, data persists to disk
const client = new ChromaClient({
  path: './chroma-data'  // Optional: persist to disk
})

Option 2: Docker (Server Mode)

docker run -p 8000:8000 chromadb/chroma

import { ChromaClient } from 'chromadb'

const client = new ChromaClient({
  path: 'http://localhost:8000'
})

Option 3: Chroma CLI

pip install chromadb
chroma run --path ./chroma-data --port 8000

7.3 Chroma Concepts

Collections

A collection is like a table—a group of related embeddings:

// Get or create a collection
const collection = await client.getOrCreateCollection({
  name: 'documents',
  metadata: { description: 'My document embeddings' }
})

Documents

Each document has:

ids: Unique identifiers (required)
documents: Original text (optional, Chroma can embed for you)
embeddings: Pre-computed embeddings (optional if documents provided)
metadatas: Key-value pairs for filtering

await collection.add({
  ids: ['doc-1', 'doc-2'],
  documents: ['First document text', 'Second document text'],
  metadatas: [
    { category: 'tutorial' },
    { category: 'guide' }
  ]
})

Built-in Embedding

Chroma can generate embeddings automatically:

// Uses default embedding function
const collection = await client.getOrCreateCollection({
  name: 'documents'
  // No embedding function = uses default
})

// Just add documents, Chroma embeds them
await collection.add({
  ids: ['doc-1'],
  documents: ['This will be embedded automatically']
})

7.4 Basic Operations

Setting Up with OpenAI Embeddings

import { ChromaClient } from 'chromadb'
import { OpenAIEmbeddingFunction } from 'chromadb'
import dotenv from 'dotenv'

dotenv.config()

const client = new ChromaClient()

// Use OpenAI embeddings
const embeddingFunction = new OpenAIEmbeddingFunction({
  openai_api_key: process.env.OPENAI_API_KEY!,
  openai_model: 'text-embedding-3-small'
})

const collection = await client.getOrCreateCollection({
  name: 'documents',
  embeddingFunction
})

Adding Documents

// Add with automatic embedding
await collection.add({
  ids: ['doc-1', 'doc-2', 'doc-3'],
  documents: [
    'Vector databases store embeddings for similarity search',
    'Chroma is an open-source embedding database',
    'Machine learning models create embeddings from text'
  ],
  metadatas: [
    { source: 'tutorial', category: 'databases' },
    { source: 'documentation', category: 'tools' },
    { source: 'textbook', category: 'ml' }
  ]
})

// Add with pre-computed embeddings
await collection.add({
  ids: ['doc-4'],
  embeddings: [[0.1, -0.2, 0.3, ...]],  // Your pre-computed embedding
  metadatas: [{ source: 'api' }]
})

Querying

// Query with text (auto-embedded)
const results = await collection.query({
  queryTexts: ['How do I store vectors?'],
  nResults: 5
})

console.log(results)
// {
//   ids: [['doc-1', 'doc-2', ...]],
//   documents: [['Vector databases store...', 'Chroma is...']],
//   metadatas: [[{ source: 'tutorial' }, ...]],
//   distances: [[0.23, 0.45, ...]]
// }

// Query with embedding
const results = await collection.query({
  queryEmbeddings: [queryEmbedding],
  nResults: 5
})

Filtering

// Filter by metadata
const results = await collection.query({
  queryTexts: ['vector search'],
  nResults: 10,
  where: { category: 'databases' }
})

// Complex filters
const results = await collection.query({
  queryTexts: ['machine learning'],
  nResults: 10,
  where: {
    $and: [
      { category: { $eq: 'ml' } },
      { source: { $in: ['tutorial', 'documentation'] } }
    ]
  }
})

Supported operators:

$eq, $ne: Equal, not equal
$gt, $gte, $lt, $lte: Comparisons
$in, $nin: In list, not in list
$and, $or: Logical operators

Updating Documents

// Update documents
await collection.update({
  ids: ['doc-1'],
  documents: ['Updated document text'],
  metadatas: [{ category: 'updated' }]
})

Deleting Documents

// Delete by IDs
await collection.delete({
  ids: ['doc-1', 'doc-2']
})

// Delete by filter
await collection.delete({
  where: { category: 'deprecated' }
})

Getting Documents by ID

const docs = await collection.get({
  ids: ['doc-1', 'doc-2'],
  include: ['documents', 'metadatas', 'embeddings']
})

7.5 Complete Example: Document Q&A System

import { ChromaClient, OpenAIEmbeddingFunction } from 'chromadb'
import OpenAI from 'openai'
import dotenv from 'dotenv'

dotenv.config()

const client = new ChromaClient()
const openai = new OpenAI()

interface Document {
  id: string
  title: string
  content: string
  source: string
}

class DocumentQA {
  private collection: any

  async initialize() {
    const embeddingFunction = new OpenAIEmbeddingFunction({
      openai_api_key: process.env.OPENAI_API_KEY!,
      openai_model: 'text-embedding-3-small'
    })

    this.collection = await client.getOrCreateCollection({
      name: 'knowledge-base',
      embeddingFunction
    })

    console.log('DocumentQA initialized')
  }

  async addDocuments(documents: Document[]) {
    // Chunk large documents
    const chunkedDocs: { id: string; text: string; metadata: object }[] = []

    for (const doc of documents) {
      const chunks = this.chunkText(doc.content, 500)
      chunks.forEach((chunk, i) => {
        chunkedDocs.push({
          id: `${doc.id}-chunk-${i}`,
          text: chunk,
          metadata: {
            documentId: doc.id,
            title: doc.title,
            source: doc.source,
            chunkIndex: i
          }
        })
      })
    }

    // Add to collection
    await this.collection.add({
      ids: chunkedDocs.map(d => d.id),
      documents: chunkedDocs.map(d => d.text),
      metadatas: chunkedDocs.map(d => d.metadata)
    })

    console.log(`Added ${chunkedDocs.length} chunks from ${documents.length} documents`)
  }

  async askQuestion(question: string): Promise<string> {
    // Find relevant chunks
    const results = await this.collection.query({
      queryTexts: [question],
      nResults: 5
    })

    if (!results.documents[0]?.length) {
      return "I don't have enough information to answer that question."
    }

    // Build context
    const context = results.documents[0]
      .map((doc: string, i: number) => {
        const meta = results.metadatas[0][i]
        return `[${meta.title}]: ${doc}`
      })
      .join('\n\n---\n\n')

    // Generate answer
    const completion = await openai.chat.completions.create({
      model: 'gpt-4-turbo',
      messages: [
        {
          role: 'system',
          content: `You are a helpful assistant. Answer questions based on the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
${context}`
        },
        { role: 'user', content: question }
      ],
      temperature: 0.3
    })

    return completion.choices[0].message.content ?? 'Unable to generate answer'
  }

  async searchDocuments(query: string, filters?: object) {
    const results = await this.collection.query({
      queryTexts: [query],
      nResults: 10,
      where: filters
    })

    return results.documents[0].map((doc: string, i: number) => ({
      content: doc,
      metadata: results.metadatas[0][i],
      distance: results.distances?.[0][i]
    }))
  }

  private chunkText(text: string, maxWords: number): string[] {
    const sentences = text.split(/[.!?]+/)
    const chunks: string[] = []
    let currentChunk: string[] = []
    let wordCount = 0

    for (const sentence of sentences) {
      const words = sentence.trim().split(/\s+/)
      if (wordCount + words.length > maxWords && currentChunk.length > 0) {
        chunks.push(currentChunk.join('. ') + '.')
        currentChunk = []
        wordCount = 0
      }
      currentChunk.push(sentence.trim())
      wordCount += words.length
    }

    if (currentChunk.length > 0) {
      chunks.push(currentChunk.join('. '))
    }

    return chunks
  }
}

// Usage
async function main() {
  const qa = new DocumentQA()
  await qa.initialize()

  // Add sample documents
  await qa.addDocuments([
    {
      id: 'doc-1',
      title: 'Introduction to Vector Databases',
      content: `Vector databases are specialized database systems designed to store
        and query high-dimensional vectors. They are essential for AI applications
        that need to find similar items based on semantic meaning rather than
        exact keyword matches. Common use cases include semantic search,
        recommendation systems, and RAG applications.`,
      source: 'tutorial'
    },
    {
      id: 'doc-2',
      title: 'Chroma Overview',
      content: `Chroma is an open-source embedding database that makes it easy to
        build AI applications. It supports automatic embedding generation,
        metadata filtering, and persistence. Chroma is particularly well-suited
        for local development and prototyping due to its simple API and
        embedded mode capabilities.`,
      source: 'documentation'
    }
  ])

  // Ask questions
  const answer1 = await qa.askQuestion('What are vector databases used for?')
  console.log('Q: What are vector databases used for?')
  console.log('A:', answer1)

  const answer2 = await qa.askQuestion('What makes Chroma good for development?')
  console.log('\nQ: What makes Chroma good for development?')
  console.log('A:', answer2)
}

main()

7.6 Persistence

In-Memory (Default)

const client = new ChromaClient()  // Data lost on restart

Persistent Storage

// Data persists to disk
const client = new ChromaClient({
  path: './chroma-data'
})

Server Mode

# Start server with persistent storage
chroma run --path ./chroma-data

const client = new ChromaClient({
  path: 'http://localhost:8000'
})

7.7 Best Practices

Collection Management

// List collections
const collections = await client.listCollections()

// Delete collection
await client.deleteCollection({ name: 'old-collection' })

// Reset everything (careful!)
await client.reset()

Batch Operations

// Add in batches for large datasets
const batchSize = 5000
for (let i = 0; i < documents.length; i += batchSize) {
  const batch = documents.slice(i, i + batchSize)
  await collection.add({
    ids: batch.map(d => d.id),
    documents: batch.map(d => d.text),
    metadatas: batch.map(d => d.metadata)
  })
  console.log(`Processed ${Math.min(i + batchSize, documents.length)}/${documents.length}`)
}

Error Handling

try {
  await collection.add({
    ids: ['duplicate-id'],  // Will error if ID exists
    documents: ['Some text']
  })
} catch (error) {
  if (error.message.includes('already exists')) {
    // Use upsert pattern
    await collection.update({
      ids: ['duplicate-id'],
      documents: ['Updated text']
    })
  }
}

Key Takeaways

Chroma is simple—perfect for getting started quickly
Embedded mode means no server to manage
Built-in embedding makes prototyping fast
Persistence keeps your data between restarts
Best for local dev, small deployments, and learning

Exercise: Build a Note Search App

Create a Chroma collection for personal notes
Add 20+ sample notes with categories
Implement semantic search
Add filtering by category and date

// Starter template
const notes = [
  {
    id: 'note-1',
    content: 'Meeting notes from the product review...',
    category: 'work',
    date: '2024-01-15'
  },
  // Add more notes...
]

Next up: Module 8 - Indexing Strategies