Principal Component Analysis (PCA)

Modern AI datasets routinely have hundreds or thousands of features. An image is tens of thousands of pixel values. A genomics dataset might have 20,000 gene expression levels per sample. Working with all these dimensions is slow, memory-intensive, and often counterproductive. Principal Component Analysis (PCA) uses the eigenvectors and eigenvalues from the previous lesson to solve this problem, finding the few directions in data that carry nearly all the information.

The Curse of Dimensionality

More features sounds like it should mean better predictions, but in practice the opposite is often true. As dimensions increase:

Data becomes sparse -- points spread out and distances between them become meaningless
Models need exponentially more training samples to learn reliable patterns
Overfitting increases because models latch onto noise in high-dimensional spaces
Training becomes slower and more memory-intensive

This is called the curse of dimensionality, and it is one of the most important practical challenges in machine learning.

PCA Intuition: Finding the Best Viewing Angle

Imagine you are photographing a 3D sculpture. Some angles capture almost everything about its shape in a single flat photo. Other angles are useless -- they hide all the interesting detail. PCA finds the best angles to view your high-dimensional data so that you capture the maximum amount of information in the fewest dimensions.

Mathematically, PCA finds the directions along which your data varies the most. These directions of maximum variance are exactly the eigenvectors of the data's covariance matrix.

The PCA Algorithm Step by Step

PCA follows four clear steps:

Step 1: Center the data -- Subtract the mean of each feature so the data is centered at the origin.

Original:  [5, 3, 8]    Mean: [4, 2, 7]
Centered:  [1, 1, 1]

Step 2: Compute the covariance matrix -- This matrix captures how every pair of features varies together. For n features, you get an n x n symmetric matrix.

C = | var(x₁)    cov(x₁,x₂)  cov(x₁,x₃) |
    | cov(x₂,x₁)  var(x₂)    cov(x₂,x₃) |
    | cov(x₃,x₁)  cov(x₃,x₂)  var(x₃)   |

Step 3: Find the eigenvectors and eigenvalues -- The eigenvectors of the covariance matrix are the principal components. The eigenvalues tell you how much variance each component captures.

Step 4: Project the data -- Multiply your centered data by the top eigenvectors to transform it into the new, lower-dimensional space.

The Key Insight: Eigenvectors ARE the Principal Components

This is where the previous lesson connects directly. The covariance matrix is symmetric, so its eigenvectors are orthogonal (perpendicular to each other) and its eigenvalues are all real numbers. Each eigenvector points in a direction of variance in the data, and its corresponding eigenvalue measures how much variance that direction captures.

Component	Eigenvector	Eigenvalue	Meaning
PC1	v₁	λ₁ = 45.2	Direction of greatest variance
PC2	v₂	λ₂ = 12.8	Direction of second greatest variance
PC3	v₃	λ₃ = 1.3	Direction of least variance

The eigenvector with the largest eigenvalue is PC1 -- the single most informative direction in the dataset.

Choosing How Many Components to Keep

You rarely need all the components. The explained variance ratio tells you what fraction of total information each component captures:

Explained variance ratio for PCk = λk / (λ₁ + λ₂ + ... + λn)

A common strategy is to keep enough components to explain 95% of the total variance. For example:

Components Kept	Cumulative Variance Explained
1	72%
2	85%
5	93%
10	97%
100 (original)	100%

In this example, keeping just 10 out of 100 features preserves 97% of the information. You have reduced your data by 90% while losing almost nothing meaningful.

The Scree Plot

A scree plot graphs eigenvalues (or explained variance) against component number. You look for the elbow point where eigenvalues drop sharply and then level off. Components before the elbow carry real signal; components after it are mostly noise.

Variance
  |
  |*
  | *
  |  *
  |    *
  |      * * * * * * *    <-- elbow here (keep ~4 components)
  +-------------------
   1 2 3 4 5 6 7 8 9 10   Component

Concrete Example: From 100 Features to 10

Suppose you have a dataset of 10,000 customer records with 100 features (demographics, purchase history, browsing behavior). Training a model directly on 100 features is slow and prone to overfitting.

Apply PCA:

Center the 10,000 x 100 data matrix
Compute the 100 x 100 covariance matrix
Find 100 eigenvectors and eigenvalues
The top 10 eigenvalues account for 95% of total variance
Project data onto these 10 eigenvectors, creating a 10,000 x 10 matrix

The result: 10 features instead of 100, training is 10x faster, and the model generalizes better because noise has been stripped away.

AI Applications of PCA

PCA appears throughout machine learning and data science:

Feature reduction: Compress high-dimensional data before training classifiers or regression models
Visualization: Project data to 2D or 3D for human-interpretable plots (scatter plots of clusters)
Noise removal: Discard low-variance components that represent measurement noise
Preprocessing for deep learning: Reduce input dimensionality for faster training
Image compression: Represent images using fewer principal components
Anomaly detection: Points that are far from the principal component subspace may be outliers

Summary

The curse of dimensionality makes high-dimensional data slow, sparse, and prone to overfitting
PCA finds the directions of maximum variance in data by computing eigenvectors of the covariance matrix
The four steps are: center data, compute covariance matrix, find eigenvectors, project
Eigenvalues rank components by how much variance each captures
The explained variance ratio guides how many components to keep (often 95% threshold)
A scree plot visualizes the elbow point for choosing components
PCA is used for feature reduction, visualization, noise removal, and preprocessing across AI

Next, we will expand beyond PCA to see how eigendecomposition and its generalization, Singular Value Decomposition, power some of the most famous AI systems ever built -- from Netflix recommendations to Google PageRank.