Lesson 2.1: Storage Engines - How Data Lives on Disk
Introduction
Ever wondered what actually happens when you run a SQL query? Why does SELECT * FROM users WHERE id = 123 return in 2ms, but SELECT * FROM logs WHERE created_at > '2024-01-01' takes 30 seconds?
The answer lies in understanding how databases work under the hood: storage engines, indexes, query planners, and caching.
What Is a Storage Engine?
A storage engine is the component that manages how data is physically stored and retrieved from disk. Different engines optimize for different workloads.
Common Storage Engine Types
Heap Storage (Row-Based)
- How it works: Records stored in insertion order, no clustering
- Used by: PostgreSQL (default), older MySQL versions
- Best for: OLTP workloads with random lookups
- Example:
-- Data stored sequentially as inserted
INSERT INTO users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');
-- Physically stored: [1, 'Alice', 'alice@example.com'] → Block 1
B-Tree Clustered Storage
- How it works: Data sorted by primary key on disk
- Used by: MySQL InnoDB (default), SQL Server
- Best for: Range queries on primary key
- Example:
-- InnoDB stores rows sorted by id
-- Disk layout: id=1→id=2→id=3... (physically ordered)
SELECT * FROM users WHERE id BETWEEN 100 AND 200; -- Very fast
LSM Tree (Log-Structured Merge)
- How it works: Writes buffered in memory, merged to disk periodically
- Used by: Cassandra, RocksDB, some modern SQL variants
- Best for: Write-heavy workloads (logs, time-series)
- Tradeoff: Slower reads, faster writes
Column Store
- How it works: Each column stored separately
- Used by: ClickHouse, BigQuery, Redshift
- Best for: OLAP (aggregations over millions of rows)
- Example:
-- Row store: [id=1,name='Alice',age=30], [id=2,name='Bob',age=25]
-- Column store: [id: 1,2], [name: 'Alice','Bob'], [age: 30,25]
-- This query only reads the 'age' column
SELECT AVG(age) FROM users; -- Extremely fast in column store
PostgreSQL Storage Architecture
MVCC (Multi-Version Concurrency Control)
PostgreSQL doesn't lock rows for reads. Instead, it keeps multiple versions:
-- Transaction 1
BEGIN;
UPDATE users SET balance = 100 WHERE id = 1;
-- Old version still visible to other transactions
-- New version only visible after COMMIT
COMMIT;
How it works:
- Each row has hidden columns:
xmin(created by transaction) andxmax(deleted by transaction) - Readers see the version valid for their transaction timestamp
- VACUUM process cleans up old versions
Why this matters for AI:
- Enables high-concurrency reads (ML inference can't block writes)
- Allows long-running analytics without blocking OLTP
- Tradeoff: Requires regular VACUUM maintenance
Storage Layout Example
-- Create a table
CREATE TABLE embeddings (
id BIGSERIAL PRIMARY KEY,
document_id BIGINT NOT NULL,
vector VECTOR(1536),
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Insert 1 million rows
INSERT INTO embeddings (document_id, vector)
SELECT
(random() * 100000)::bigint,
array_agg(random())::vector(1536)
FROM generate_series(1, 1000000);
-- Check physical size
SELECT pg_size_pretty(pg_total_relation_size('embeddings'));
-- Result: ~2.5 GB (1536 floats × 4 bytes × 1M rows + overhead)
Understanding storage:
- Main table: Actual row data
- TOAST table: Large values (vectors) stored separately
- Indexes: B-tree on id, potentially IVFFlat on vector
- Visibility map: Tracks which pages need VACUUM
Key Takeaways
- Storage engines determine how data is physically organized on disk
- Heap storage (PostgreSQL) stores rows in insertion order
- B-Tree clustered storage (MySQL InnoDB) stores rows sorted by primary key
- Column stores are optimized for analytics, row stores for transactions
- MVCC enables high concurrency by keeping multiple row versions
- Understanding storage helps explain query performance characteristics

