Computational Efficiency and GPUs
We have seen that transformers and neural networks are built from matrix multiplications -- thousands of them per forward pass, across billions of parameters. The question is no longer "what math does AI use?" but "how do we compute this much math fast enough to be useful?" The answer to this question shapes the entire AI industry.
The Cost of Matrix Multiplication
Multiplying two n x n matrices using the straightforward algorithm requires O(n^3) operations. That means doubling the matrix size increases computation by roughly 8 times.
| Matrix Size | Multiplications | Approximate Time (single core) |
|---|---|---|
| 100 x 100 | 1,000,000 | Instant |
| 1,000 x 1,000 | 1,000,000,000 | A few seconds |
| 10,000 x 10,000 | 1,000,000,000,000 | Hours |
| 100,000 x 100,000 | 10^15 | Weeks |
Modern AI models involve matrices with tens of thousands of dimensions, and they must perform these multiplications thousands of times per input, across millions of training examples. On a single CPU core, this would take years.
How GPUs Solve the Problem
A CPU is designed for general-purpose computing -- it has a few powerful cores (typically 8 to 24) optimized for complex, sequential tasks. A GPU is designed for parallel computing -- it has thousands of smaller cores optimized for doing the same simple operation across many data points simultaneously.
Matrix multiplication is the ideal workload for GPUs because each element of the result matrix can be computed independently. You need the dot product of a row and a column -- and all of these dot products can happen at the same time.
| Feature | CPU | GPU |
|---|---|---|
| Core count | 8-24 | 5,000-16,000+ |
| Clock speed per core | High (4-5 GHz) | Lower (1-2 GHz) |
| Best at | Sequential, complex logic | Parallel, repetitive math |
| Matrix multiply speed | Baseline | 10x-100x faster |
| Memory bandwidth | ~50 GB/s | ~2,000 GB/s |
A modern data center GPU like the NVIDIA H100 can perform over 1,000 trillion floating-point operations per second (1 petaFLOP) for half-precision math. This is what makes training billion-parameter models feasible.
Batch Processing: Parallelism in Another Dimension
GPUs do not just parallelize a single matrix multiplication -- they also enable batch processing. Instead of processing one input at a time, we stack multiple inputs into a single large matrix and process them all simultaneously.
Single input: W x x (matrix times vector)
Batch of 32: W x X (matrix times matrix, where X has 32 columns)
The weight matrix W is the same for every input. By batching 32 inputs together, the GPU performs one large matrix multiplication instead of 32 smaller ones. This is far more efficient because it better utilizes the GPU's parallel cores and memory bandwidth.
Batch sizes in AI training typically range from 32 to thousands, depending on available GPU memory.
Memory: The Other Bottleneck
Raw computation speed is only half the story. AI models must also fit in GPU memory. Every parameter is a number that must be stored, and during training, we also store gradients and optimizer states.
A model's memory footprint depends on the number of parameters and the precision of each number:
| Model | Parameters | Size (float32) | Size (float16) |
|---|---|---|---|
| BERT-base | 110 million | 440 MB | 220 MB |
| GPT-2 | 1.5 billion | 6 GB | 3 GB |
| LLaMA 70B | 70 billion | 280 GB | 140 GB |
| GPT-4 (estimated) | ~1.8 trillion | ~7.2 TB | ~3.6 TB |
During training, memory usage is roughly 3-4 times the model size (for gradients and optimizer states). This is why large models require clusters of many GPUs working together.
Mixed Precision: Trading Accuracy for Speed
Every number in a matrix can be stored at different levels of precision:
- float32 (32 bits): Full precision, standard for scientific computing
- float16 (16 bits): Half precision, half the memory, roughly 2x faster on GPUs
- bfloat16 (16 bits): Brain floating point, same range as float32 but less precision
- int8 (8 bits): Quarter the memory of float32, used for inference
Modern AI training uses mixed precision: most computations happen in float16 or bfloat16, while critical accumulations (like loss values) stay in float32. This nearly doubles training speed with minimal impact on model quality.
Standard: float32 weights -> float32 multiply -> float32 result
Mixed precision: float16 weights -> float16 multiply -> float32 accumulate
GPUs are specifically designed to accelerate these lower-precision operations. The NVIDIA H100's "Tensor Cores" deliver 8x more throughput for float16 operations compared to float32.
Why Bigger Models Need More GPUs
When a model is too large to fit on a single GPU, it must be split across multiple GPUs. There are two main strategies:
- Data parallelism: each GPU holds a full copy of the model but processes different batches of data. Gradients are synchronized across GPUs after each step.
- Model parallelism: the model itself is split across GPUs. Different layers (or even different parts of a single matrix) live on different GPUs.
Training GPT-3 required a cluster of thousands of GPUs running for weeks. The cost was estimated at several million dollars in compute alone. This is a direct consequence of the sheer volume of matrix multiplications involved.
Why Linear Algebra Efficiency Determines AI Progress
The history of AI breakthroughs maps directly to advances in matrix multiplication efficiency:
- 2012: AlexNet used GPUs for deep learning, outperforming all prior image recognition systems
- 2017: Transformers replaced recurrent networks with architectures that better exploit GPU parallelism for matrix operations
- 2020-present: Scaling laws show that model performance improves predictably with more compute (more matrix multiplications)
Every major advancement in AI hardware -- from Tensor Cores to TPUs to custom AI chips -- is fundamentally an advancement in how quickly we can multiply matrices. The speed of matrix multiplication is the speed limit of AI progress.
Summary
- Matrix multiplication has O(n^3) complexity, making it extremely expensive for large matrices
- GPUs solve this with thousands of parallel cores, achieving 10x-100x speedup over CPUs for matrix operations
- Batch processing turns multiple inputs into a single large matrix multiplication, maximizing GPU utilization
- Memory is a critical bottleneck -- large models require hundreds of gigabytes just to store their parameters
- Mixed precision (float16/bfloat16) nearly doubles speed and halves memory usage with minimal quality loss
- Models too large for one GPU are split across many using data and model parallelism
- The efficiency of matrix multiplication is the fundamental constraint on AI progress
With a solid understanding of matrix multiplication -- its mechanics, its role in transformers, and the engineering required to compute it at scale -- we are ready to explore another fundamental operation in Module 4: the dot product and how it measures similarity.

