Lesson 2.1: Storage Engines - How Data Lives on Disk

Introduction

Ever wondered what actually happens when you run a SQL query? Why does SELECT * FROM users WHERE id = 123 return in 2ms, but SELECT * FROM logs WHERE created_at > '2024-01-01' takes 30 seconds?

The answer lies in understanding how databases work under the hood: storage engines, indexes, query planners, and caching.

What Is a Storage Engine?

A storage engine is the component that manages how data is physically stored and retrieved from disk. Different engines optimize for different workloads.

Common Storage Engine Types

Heap Storage (Row-Based)

How it works: Records stored in insertion order, no clustering
Used by: PostgreSQL (default), older MySQL versions
Best for: OLTP workloads with random lookups
Example:

-- Data stored sequentially as inserted
INSERT INTO users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');
-- Physically stored: [1, 'Alice', 'alice@example.com'] → Block 1

B-Tree Clustered Storage

How it works: Data sorted by primary key on disk
Used by: MySQL InnoDB (default), SQL Server
Best for: Range queries on primary key
Example:

-- InnoDB stores rows sorted by id
-- Disk layout: id=1→id=2→id=3... (physically ordered)
SELECT * FROM users WHERE id BETWEEN 100 AND 200;  -- Very fast

LSM Tree (Log-Structured Merge)

How it works: Writes buffered in memory, merged to disk periodically
Used by: Cassandra, RocksDB, some modern SQL variants
Best for: Write-heavy workloads (logs, time-series)
Tradeoff: Slower reads, faster writes

Column Store

How it works: Each column stored separately
Used by: ClickHouse, BigQuery, Redshift
Best for: OLAP (aggregations over millions of rows)
Example:

-- Row store: [id=1,name='Alice',age=30], [id=2,name='Bob',age=25]
-- Column store: [id: 1,2], [name: 'Alice','Bob'], [age: 30,25]

-- This query only reads the 'age' column
SELECT AVG(age) FROM users;  -- Extremely fast in column store

PostgreSQL Storage Architecture

MVCC (Multi-Version Concurrency Control)

PostgreSQL doesn't lock rows for reads. Instead, it keeps multiple versions:

-- Transaction 1
BEGIN;
UPDATE users SET balance = 100 WHERE id = 1;
-- Old version still visible to other transactions
-- New version only visible after COMMIT
COMMIT;

How it works:

Each row has hidden columns: xmin (created by transaction) and xmax (deleted by transaction)
Readers see the version valid for their transaction timestamp
VACUUM process cleans up old versions

Why this matters for AI:

Enables high-concurrency reads (ML inference can't block writes)
Allows long-running analytics without blocking OLTP
Tradeoff: Requires regular VACUUM maintenance

Storage Layout Example

-- Create a table
CREATE TABLE embeddings (
    id BIGSERIAL PRIMARY KEY,
    document_id BIGINT NOT NULL,
    vector VECTOR(1536),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Insert 1 million rows
INSERT INTO embeddings (document_id, vector)
SELECT
    (random() * 100000)::bigint,
    array_agg(random())::vector(1536)
FROM generate_series(1, 1000000);

-- Check physical size
SELECT pg_size_pretty(pg_total_relation_size('embeddings'));
-- Result: ~2.5 GB (1536 floats × 4 bytes × 1M rows + overhead)

Understanding storage:

Main table: Actual row data
TOAST table: Large values (vectors) stored separately
Indexes: B-tree on id, potentially IVFFlat on vector
Visibility map: Tracks which pages need VACUUM

Key Takeaways

Storage engines determine how data is physically organized on disk
Heap storage (PostgreSQL) stores rows in insertion order
B-Tree clustered storage (MySQL InnoDB) stores rows sorted by primary key
Column stores are optimized for analytics, row stores for transactions
MVCC enables high concurrency by keeping multiple row versions
Understanding storage helps explain query performance characteristics