Lesson 2.2: B-Trees and Hash Indexes

The Fundamental Data Structure: B-Tree

B-trees are the reason databases are fast. Understanding them is critical.

What is a B-Tree?

A balanced tree structure that keeps data sorted and allows searches, insertions, and deletions in O(log n) time.

Structure:

                [50 | 100]              ← Root (level 0)
               /    |    \
              /     |     \
     [10|20|30]  [60|70]  [110|120]    ← Internal nodes (level 1)
       / | | \    / |  \    / |  \
     [...rows...] [...rows...] [...]   ← Leaf nodes (level 2, actual data pointers)

Key properties:

Each node has multiple keys (typically 100-1000)
Tree height stays small (4 levels = 1 billion rows)
Nodes are sized to match disk pages (8KB in Postgres)

Example: Finding id = 75

1. Read root node: [50 | 100] → Go right (75 > 50, 75 < 100)
2. Read node: [60 | 70] → Go right (75 > 70)
3. Read leaf node → Find row
Total disk reads: 3 (usually all cached after first query)

Index Types in PostgreSQL

B-Tree Index (Default)

-- Create index
CREATE INDEX users_email_idx ON users(email);

-- How it works
-- Index stores: [email → row_location]
-- Sorted: 'alice@...' → 'bob@...' → 'charlie@...'

-- Fast queries
SELECT * FROM users WHERE email = 'alice@example.com';  -- Index seek
SELECT * FROM users WHERE email LIKE 'alice%';  -- Index range scan
SELECT * FROM users ORDER BY email LIMIT 10;  -- Index ordered scan

-- Slow queries (can't use index)
SELECT * FROM users WHERE email LIKE '%alice%';  -- Full table scan

Hash Index

-- Create hash index
CREATE INDEX users_email_hash_idx ON users USING HASH(email);

-- Use cases
-- Only supports equality (=), not ranges or LIKE
-- Slightly faster than B-tree for exact matches
-- Smaller index size

-- Fast
SELECT * FROM users WHERE email = 'alice@example.com';

-- Can't use hash index
SELECT * FROM users WHERE email > 'alice@example.com';  -- Needs ordering

When to use hash indexes:

Equality lookups only (no ranges, no sorting)
Very large text columns (UUIDs, hashes)
Space-constrained environments

Most of the time, stick with B-tree.

How Indexes Speed Up Queries

Without index:

-- Full table scan
SELECT * FROM users WHERE email = 'alice@example.com';

-- Postgres scans every row (1 million rows)
-- Read all pages from disk: 8KB × 125,000 pages = 1GB read
-- Time: ~2-5 seconds

With index:

-- Index seek
CREATE INDEX users_email_idx ON users(email);
SELECT * FROM users WHERE email = 'alice@example.com';

-- Postgres:
-- 1. Reads 3-4 index pages (B-tree traversal)
-- 2. Gets row location
-- 3. Reads 1 heap page (actual row)
-- Total: 4-5 page reads = ~40KB
-- Time: 1-2ms (500x faster)

Key Takeaways

B-trees keep data sorted and enable O(log n) searches
B-tree height stays small even with billions of rows (typically 3-4 levels)
B-tree indexes support equality, ranges, sorting, and prefix matching
Hash indexes only support equality but are slightly faster for exact matches
Indexes dramatically reduce disk I/O by avoiding full table scans
Most queries should use B-tree indexes as the default choice