Lesson 3.7: Why Indexing Matters More with AI Pipelines

AI Workload Characteristics

High read concurrency:

ML inference: 10,000 feature lookups/second
RAG systems: 1,000 vector searches/second
A/B tests: Thousands of metric queries/hour

Without proper indexes, database becomes bottleneck.

Real-World Impact: Feature Store

Scenario: Real-time fraud detection

-- Bad: No indexes (200ms query)
SELECT feature_1, feature_2, ... feature_50
FROM user_features
WHERE user_id = 12345;

-- At 10k requests/second:
-- 200ms × 10k = 2000 seconds of DB time per second
-- Impossible to serve!

-- Good: Covering index (2ms query)
CREATE INDEX user_features_cover_idx ON user_features(user_id)
INCLUDE (feature_1, feature_2, ... feature_50);

-- At 10k requests/second:
-- 2ms × 10k = 20 seconds of DB time per second
-- Requires 20 database connections (doable)

Bulk Write Performance

AI pipelines generate massive writes:

-- Embedding generation: 1M vectors/hour

-- Without index: Fast writes
INSERT INTO embeddings VALUES (...);  -- 0.1ms per row

-- With indexes: Slower writes
INSERT INTO embeddings VALUES (...);  -- 0.5ms per row (5x slower)

-- Strategy: Bulk load without indexes, then build
BEGIN;
DROP INDEX embeddings_vector_idx;
COPY embeddings FROM '/data/embeddings.csv';
CREATE INDEX embeddings_vector_idx ON embeddings USING hnsw(...);
COMMIT;

Index Strategy for AI Systems

1. Feature stores (read-heavy):

-- Covering index for fast lookups
CREATE INDEX features_cover_idx ON user_features(user_id)
INCLUDE (feature_vector, computed_at);

2. Model logs (write-heavy):

-- Partial index on recent data only
CREATE INDEX logs_recent_idx ON model_logs(model_id, created_at)
WHERE created_at > NOW() - INTERVAL '7 days';

3. RAG systems (hybrid):

-- B-tree on metadata
CREATE INDEX docs_user_cat_idx ON documents(user_id, category);

-- HNSW on vectors
CREATE INDEX embeddings_hnsw_idx ON embeddings USING hnsw(vector vector_cosine_ops);

4. Training data (read-once):

-- Minimal indexing (data read sequentially)
-- Only index query columns, not training features

Module 3 Summary

Key Takeaways

B-trees are the foundation—understand page structure and traversal
Covering indexes eliminate heap fetches (25x faster for some queries)
Partial indexes reduce index size by indexing only relevant rows
Expression indexes enable indexing computed values and transformations
GIN indexes excel for full-text search, JSONB, and arrays
Vector indexes (IVFFlat/HNSW) are essential for similarity search
Hybrid queries benefit from combining B-tree and vector indexes
AI workloads amplify the importance of proper indexing

Indexing Checklist

For any query:

WHERE clause columns indexed
JOIN columns indexed
Frequent columns in INCLUDE clause (covering)
Filter predicates in WHERE clause (partial)
Computed expressions indexed (expression index)
JSONB queries use GIN
Vector queries use IVFFlat/HNSW
Verified with EXPLAIN ANALYZE

What's Next

Module 4 covers schema design for AI systems:

Normalization vs denormalization tradeoffs
How AI pipelines break naive schemas
JSONB and semi-structured data
Designing for scale and performance

Continue to Module 4: Designing Robust SQL Schemas