Lesson 4.1: Normalization vs Denormalization
Introduction
Schema design is where most AI systems fail. A schema that works great for 10k rows collapses at 10M rows. A schema optimized for writes becomes unqueryable. A schema designed for one use case breaks when ML engineers need ad-hoc analytics.
This lesson teaches you how to balance normalization and denormalization for real-world AI workloads.
The Normalization Spectrum
Highly Normalized (3NF):
-- Separate tables, no redundancy
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL
);
CREATE TABLE user_profiles (
user_id BIGINT PRIMARY KEY REFERENCES users(id),
name TEXT,
bio TEXT
);
CREATE TABLE user_preferences (
user_id BIGINT PRIMARY KEY REFERENCES users(id),
theme TEXT,
language TEXT
);
Pros:
- No data redundancy
- Easy to update (single source of truth)
- Smaller total storage
Cons:
- Requires joins for common queries
- Slower reads (3 table JOIN)
- Complex queries
Denormalized:
-- Everything in one table
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
name TEXT,
bio TEXT,
theme TEXT,
language TEXT,
last_login TIMESTAMPTZ,
order_count INT,
total_spent NUMERIC
);
Pros:
- Fast reads (no joins)
- Simple queries
- Single table scans
Cons:
- Data redundancy
- Update anomalies (total_spent must be recalculated)
- Larger storage
The AI Era Sweet Spot
Hybrid approach:
-- Core entities: Normalized
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Hot data: Denormalized for speed
CREATE TABLE user_features (
user_id BIGINT PRIMARY KEY REFERENCES users(id),
-- Computed features (denormalized)
order_count_7d INT,
order_count_30d INT,
avg_order_value NUMERIC,
last_order_date DATE,
-- Refreshed by background job
computed_at TIMESTAMPTZ DEFAULT NOW()
);
-- Historical data: Normalized
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
total NUMERIC NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Strategy:
- Transactional data: Normalized (users, orders, products)
- ML features: Denormalized (pre-computed for speed)
- Logs/events: Semi-normalized (append-only)
Key Takeaways
- Full normalization (3NF) minimizes redundancy but requires joins
- Full denormalization speeds reads but causes update anomalies
- AI systems benefit from hybrid approach: normalize transactions, denormalize features
- Pre-compute ML features in denormalized tables for fast inference
- Choose normalization level based on read/write patterns
- Transactional data stays normalized, analytical data gets denormalized

