Lesson 5.1: Identifying Slow Query Patterns

Introduction

AI systems push databases to their limits. Feature stores need to serve 10k reads/second. ML pipelines write millions of rows/hour. Real-time inference can't tolerate slow queries.

This lesson teaches you how to identify performance bottlenecks.

Using pg_stat_statements

-- Enable extension (add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements')
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slowest queries
SELECT
    query,
    calls,
    total_exec_time / 1000 AS total_seconds,
    mean_exec_time AS avg_ms,
    max_exec_time AS max_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Common Slow Patterns

1. Missing indexes:

-- Full table scan
SELECT * FROM users WHERE email = 'test@example.com';

-- Fix: Add index
CREATE INDEX users_email_idx ON users(email);

2. N+1 queries:

# Bad: N+1 queries
users = db.execute("SELECT id, name FROM users")
for user in users:
    orders = db.execute(f"SELECT * FROM orders WHERE user_id = {user.id}")  # N queries!

# Good: Single JOIN
SELECT u.id, u.name, o.id AS order_id, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

3. Unbounded queries:

-- Bad: Returns entire table
SELECT * FROM model_predictions WHERE model_id = 'gpt-4';  -- 10M rows!

-- Good: Add LIMIT and WHERE created_at
SELECT * FROM model_predictions
WHERE model_id = 'gpt-4'
    AND created_at > NOW() - INTERVAL '7 days'
LIMIT 1000;

4. Function in WHERE clause:

-- Bad: Can't use index
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';

-- Good: Expression index or store normalized
CREATE INDEX users_email_lower_idx ON users(LOWER(email));

EXPLAIN Deep Dive

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT d.content, e.vector <=> query_vector AS dist
FROM documents d
JOIN embeddings e ON d.id = e.document_id
WHERE d.user_id = 123
ORDER BY dist
LIMIT 10;

Key metrics:

actual time: Real execution time
rows: Estimated vs actual row counts
Buffers: shared hit: Cache hits (good)
Buffers: shared read: Disk reads (slow)
Heap Fetches: Rows read from table (minimize)

Key Takeaways

Enable pg_stat_statements to track query performance
Common slow patterns: missing indexes, N+1 queries, unbounded queries, functions in WHERE
Use EXPLAIN ANALYZE to diagnose slow queries
Focus on queries with high total_exec_time (calls × avg_time)
Buffer metrics reveal I/O bottlenecks
Expression indexes solve function-in-WHERE problems

Using pg_stat_statements

-- Enable extension (add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements') CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Find slowest queries SELECT query, calls, total_exec_time / 1000 AS total_seconds, mean_exec_time AS avg_ms, max_exec_time AS max_ms FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

Common Slow Patterns

1. Missing indexes:

-- Full table scan SELECT * FROM users WHERE email = 'test@example.com'; -- Fix: Add index CREATE INDEX users_email_idx ON users(email);

2. N+1 queries:

# Bad: N+1 queries users = db.execute("SELECT id, name FROM users") for user in users: orders = db.execute(f"SELECT * FROM orders WHERE user_id = {user.id}") # N queries! # Good: Single JOIN SELECT u.id, u.name, o.id AS order_id, o.total FROM users u LEFT JOIN orders o ON u.id = o.user_id;

3. Unbounded queries:

-- Bad: Returns entire table SELECT * FROM model_predictions WHERE model_id = 'gpt-4'; -- 10M rows! -- Good: Add LIMIT and WHERE created_at SELECT * FROM model_predictions WHERE model_id = 'gpt-4' AND created_at > NOW() - INTERVAL '7 days' LIMIT 1000;

4. Function in WHERE clause:

-- Bad: Can't use index SELECT * FROM users WHERE LOWER(email) = 'test@example.com'; -- Good: Expression index or store normalized CREATE INDEX users_email_lower_idx ON users(LOWER(email));

EXPLAIN Deep Dive

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT d.content, e.vector <=> query_vector AS dist FROM documents d JOIN embeddings e ON d.id = e.document_id WHERE d.user_id = 123 ORDER BY dist LIMIT 10;

Key metrics:

actual time: Real execution time

rows: Estimated vs actual row counts

Buffers: shared hit: Cache hits (good)

Buffers: shared read: Disk reads (slow)

Heap Fetches: Rows read from table (minimize)

Key Takeaways

Enable pg_stat_statements to track query performance

Common slow patterns: missing indexes, N+1 queries, unbounded queries, functions in WHERE

Use EXPLAIN ANALYZE to diagnose slow queries

Focus on queries with high total_exec_time (calls × avg_time)

Buffer metrics reveal I/O bottlenecks

Expression indexes solve function-in-WHERE problems