Computational Efficiency and GPUs

We have seen that transformers and neural networks are built from matrix multiplications -- thousands of them per forward pass, across billions of parameters. The question is no longer "what math does AI use?" but "how do we compute this much math fast enough to be useful?" The answer to this question shapes the entire AI industry.

The Cost of Matrix Multiplication

Multiplying two n x n matrices using the straightforward algorithm requires O(n^3) operations. That means doubling the matrix size increases computation by roughly 8 times.

Matrix Size	Multiplications	Approximate Time (single core)
100 x 100	1,000,000	Instant
1,000 x 1,000	1,000,000,000	A few seconds
10,000 x 10,000	1,000,000,000,000	Hours
100,000 x 100,000	10^15	Weeks

Modern AI models involve matrices with tens of thousands of dimensions, and they must perform these multiplications thousands of times per input, across millions of training examples. On a single CPU core, this would take years.

How GPUs Solve the Problem

A CPU is designed for general-purpose computing -- it has a few powerful cores (typically 8 to 24) optimized for complex, sequential tasks. A GPU is designed for parallel computing -- it has thousands of smaller cores optimized for doing the same simple operation across many data points simultaneously.

Matrix multiplication is the ideal workload for GPUs because each element of the result matrix can be computed independently. You need the dot product of a row and a column -- and all of these dot products can happen at the same time.

Feature	CPU	GPU
Core count	8-24	5,000-16,000+
Clock speed per core	High (4-5 GHz)	Lower (1-2 GHz)
Best at	Sequential, complex logic	Parallel, repetitive math
Matrix multiply speed	Baseline	10x-100x faster
Memory bandwidth	~50 GB/s	~2,000 GB/s

A modern data center GPU like the NVIDIA H100 can perform over 1,000 trillion floating-point operations per second (1 petaFLOP) for half-precision math. This is what makes training billion-parameter models feasible.

Batch Processing: Parallelism in Another Dimension

GPUs do not just parallelize a single matrix multiplication -- they also enable batch processing. Instead of processing one input at a time, we stack multiple inputs into a single large matrix and process them all simultaneously.

Single input:    W x x       (matrix times vector)
Batch of 32:     W x X       (matrix times matrix, where X has 32 columns)

The weight matrix W is the same for every input. By batching 32 inputs together, the GPU performs one large matrix multiplication instead of 32 smaller ones. This is far more efficient because it better utilizes the GPU's parallel cores and memory bandwidth.

Batch sizes in AI training typically range from 32 to thousands, depending on available GPU memory.

Memory: The Other Bottleneck

Raw computation speed is only half the story. AI models must also fit in GPU memory. Every parameter is a number that must be stored, and during training, we also store gradients and optimizer states.

A model's memory footprint depends on the number of parameters and the precision of each number:

Model	Parameters	Size (float32)	Size (float16)
BERT-base	110 million	440 MB	220 MB
GPT-2	1.5 billion	6 GB	3 GB
LLaMA 70B	70 billion	280 GB	140 GB
GPT-4 (estimated)	~1.8 trillion	~7.2 TB	~3.6 TB

During training, memory usage is roughly 3-4 times the model size (for gradients and optimizer states). This is why large models require clusters of many GPUs working together.

Mixed Precision: Trading Accuracy for Speed

Every number in a matrix can be stored at different levels of precision:

float32 (32 bits): Full precision, standard for scientific computing
float16 (16 bits): Half precision, half the memory, roughly 2x faster on GPUs
bfloat16 (16 bits): Brain floating point, same range as float32 but less precision
int8 (8 bits): Quarter the memory of float32, used for inference

Modern AI training uses mixed precision: most computations happen in float16 or bfloat16, while critical accumulations (like loss values) stay in float32. This nearly doubles training speed with minimal impact on model quality.

Standard:        float32 weights -> float32 multiply -> float32 result
Mixed precision: float16 weights -> float16 multiply -> float32 accumulate

GPUs are specifically designed to accelerate these lower-precision operations. The NVIDIA H100's "Tensor Cores" deliver 8x more throughput for float16 operations compared to float32.

Why Bigger Models Need More GPUs

When a model is too large to fit on a single GPU, it must be split across multiple GPUs. There are two main strategies:

Data parallelism: each GPU holds a full copy of the model but processes different batches of data. Gradients are synchronized across GPUs after each step.
Model parallelism: the model itself is split across GPUs. Different layers (or even different parts of a single matrix) live on different GPUs.

Training GPT-3 required a cluster of thousands of GPUs running for weeks. The cost was estimated at several million dollars in compute alone. This is a direct consequence of the sheer volume of matrix multiplications involved.

Why Linear Algebra Efficiency Determines AI Progress

The history of AI breakthroughs maps directly to advances in matrix multiplication efficiency:

2012: AlexNet used GPUs for deep learning, outperforming all prior image recognition systems
2017: Transformers replaced recurrent networks with architectures that better exploit GPU parallelism for matrix operations
2020-present: Scaling laws show that model performance improves predictably with more compute (more matrix multiplications)

Every major advancement in AI hardware -- from Tensor Cores to TPUs to custom AI chips -- is fundamentally an advancement in how quickly we can multiply matrices. The speed of matrix multiplication is the speed limit of AI progress.

Summary

Matrix multiplication has O(n^3) complexity, making it extremely expensive for large matrices
GPUs solve this with thousands of parallel cores, achieving 10x-100x speedup over CPUs for matrix operations
Batch processing turns multiple inputs into a single large matrix multiplication, maximizing GPU utilization
Memory is a critical bottleneck -- large models require hundreds of gigabytes just to store their parameters
Mixed precision (float16/bfloat16) nearly doubles speed and halves memory usage with minimal quality loss
Models too large for one GPU are split across many using data and model parallelism
The efficiency of matrix multiplication is the fundamental constraint on AI progress

With a solid understanding of matrix multiplication -- its mechanics, its role in transformers, and the engineering required to compute it at scale -- we are ready to explore another fundamental operation in Module 4: the dot product and how it measures similarity.