Principal Component Analysis (PCA)
Modern AI datasets routinely have hundreds or thousands of features. An image is tens of thousands of pixel values. A genomics dataset might have 20,000 gene expression levels per sample. Working with all these dimensions is slow, memory-intensive, and often counterproductive. Principal Component Analysis (PCA) uses the eigenvectors and eigenvalues from the previous lesson to solve this problem, finding the few directions in data that carry nearly all the information.
The Curse of Dimensionality
More features sounds like it should mean better predictions, but in practice the opposite is often true. As dimensions increase:
- Data becomes sparse -- points spread out and distances between them become meaningless
- Models need exponentially more training samples to learn reliable patterns
- Overfitting increases because models latch onto noise in high-dimensional spaces
- Training becomes slower and more memory-intensive
This is called the curse of dimensionality, and it is one of the most important practical challenges in machine learning.
PCA Intuition: Finding the Best Viewing Angle
Imagine you are photographing a 3D sculpture. Some angles capture almost everything about its shape in a single flat photo. Other angles are useless -- they hide all the interesting detail. PCA finds the best angles to view your high-dimensional data so that you capture the maximum amount of information in the fewest dimensions.
Mathematically, PCA finds the directions along which your data varies the most. These directions of maximum variance are exactly the eigenvectors of the data's covariance matrix.
The PCA Algorithm Step by Step
PCA follows four clear steps:
Step 1: Center the data -- Subtract the mean of each feature so the data is centered at the origin.
Original: [5, 3, 8] Mean: [4, 2, 7]
Centered: [1, 1, 1]
Step 2: Compute the covariance matrix -- This matrix captures how every pair of features varies together. For n features, you get an n x n symmetric matrix.
C = | var(x₁) cov(x₁,x₂) cov(x₁,x₃) |
| cov(x₂,x₁) var(x₂) cov(x₂,x₃) |
| cov(x₃,x₁) cov(x₃,x₂) var(x₃) |
Step 3: Find the eigenvectors and eigenvalues -- The eigenvectors of the covariance matrix are the principal components. The eigenvalues tell you how much variance each component captures.
Step 4: Project the data -- Multiply your centered data by the top eigenvectors to transform it into the new, lower-dimensional space.
The Key Insight: Eigenvectors ARE the Principal Components
This is where the previous lesson connects directly. The covariance matrix is symmetric, so its eigenvectors are orthogonal (perpendicular to each other) and its eigenvalues are all real numbers. Each eigenvector points in a direction of variance in the data, and its corresponding eigenvalue measures how much variance that direction captures.
| Component | Eigenvector | Eigenvalue | Meaning |
|---|---|---|---|
| PC1 | v₁ | λ₁ = 45.2 | Direction of greatest variance |
| PC2 | v₂ | λ₂ = 12.8 | Direction of second greatest variance |
| PC3 | v₃ | λ₃ = 1.3 | Direction of least variance |
The eigenvector with the largest eigenvalue is PC1 -- the single most informative direction in the dataset.
Choosing How Many Components to Keep
You rarely need all the components. The explained variance ratio tells you what fraction of total information each component captures:
Explained variance ratio for PCk = λk / (λ₁ + λ₂ + ... + λn)
A common strategy is to keep enough components to explain 95% of the total variance. For example:
| Components Kept | Cumulative Variance Explained |
|---|---|
| 1 | 72% |
| 2 | 85% |
| 5 | 93% |
| 10 | 97% |
| 100 (original) | 100% |
In this example, keeping just 10 out of 100 features preserves 97% of the information. You have reduced your data by 90% while losing almost nothing meaningful.
The Scree Plot
A scree plot graphs eigenvalues (or explained variance) against component number. You look for the elbow point where eigenvalues drop sharply and then level off. Components before the elbow carry real signal; components after it are mostly noise.
Variance
|
|*
| *
| *
| *
| * * * * * * * <-- elbow here (keep ~4 components)
+-------------------
1 2 3 4 5 6 7 8 9 10 Component
Concrete Example: From 100 Features to 10
Suppose you have a dataset of 10,000 customer records with 100 features (demographics, purchase history, browsing behavior). Training a model directly on 100 features is slow and prone to overfitting.
Apply PCA:
- Center the 10,000 x 100 data matrix
- Compute the 100 x 100 covariance matrix
- Find 100 eigenvectors and eigenvalues
- The top 10 eigenvalues account for 95% of total variance
- Project data onto these 10 eigenvectors, creating a 10,000 x 10 matrix
The result: 10 features instead of 100, training is 10x faster, and the model generalizes better because noise has been stripped away.
AI Applications of PCA
PCA appears throughout machine learning and data science:
- Feature reduction: Compress high-dimensional data before training classifiers or regression models
- Visualization: Project data to 2D or 3D for human-interpretable plots (scatter plots of clusters)
- Noise removal: Discard low-variance components that represent measurement noise
- Preprocessing for deep learning: Reduce input dimensionality for faster training
- Image compression: Represent images using fewer principal components
- Anomaly detection: Points that are far from the principal component subspace may be outliers
Summary
- The curse of dimensionality makes high-dimensional data slow, sparse, and prone to overfitting
- PCA finds the directions of maximum variance in data by computing eigenvectors of the covariance matrix
- The four steps are: center data, compute covariance matrix, find eigenvectors, project
- Eigenvalues rank components by how much variance each captures
- The explained variance ratio guides how many components to keep (often 95% threshold)
- A scree plot visualizes the elbow point for choosing components
- PCA is used for feature reduction, visualization, noise removal, and preprocessing across AI
Next, we will expand beyond PCA to see how eigendecomposition and its generalization, Singular Value Decomposition, power some of the most famous AI systems ever built -- from Netflix recommendations to Google PageRank.

