What is Maximum Likelihood Estimation?
Maximum Likelihood Estimation (MLE) is the mathematical foundation of how AI learns from data. It provides a principled way to find the best parameters for a model by asking: "What parameters make the observed data most probable?"
The Core Idea
Given observed data, MLE finds the parameter values that maximize the probability of seeing that data.
Question: "Given what I observed, what parameters would have made this outcome most likely?"
Answer: The maximum likelihood estimates.
A Simple Example: Coin Flipping
You flip a coin 10 times and observe: HHTHHTHHTH (7 heads, 3 tails)
Question: What's the most likely value of p (probability of heads)?
Intuition
- If p = 0.5 (fair coin): Getting 7 heads is possible but not the most likely outcome
- If p = 0.7: Getting 7 heads in 10 flips is more likely
- If p = 1.0: We'd never see tails, so this can't be right
The Math
The probability of this specific sequence, given p:
P(data | p) = p^7 × (1-p)^3
This is the likelihood function L(p).
To maximize, take the derivative and set to zero:
d/dp [p^7 × (1-p)^3] = 0
Solution: p = 7/10 = 0.7
The MLE estimate is p̂ = 0.7 — exactly the observed proportion!
The Likelihood Function
The likelihood function L(θ) gives the probability of observed data given parameters θ:
L(θ) = P(data | θ)
Key insight: Same formula as probability, but we're now treating the data as fixed and the parameters as variable.
Probability vs. Likelihood
Probability: Fixed parameters, ask about different data outcomes
- "If the coin has p=0.5, what's the probability of 7 heads?"
Likelihood: Fixed data, ask about different parameter values
- "Given we saw 7 heads, how likely is p=0.5 vs. p=0.7?"
Log-Likelihood
For numerical stability and mathematical convenience, we often work with the log-likelihood:
ℓ(θ) = log L(θ)
Since log is monotonically increasing, maximizing log-likelihood gives the same answer as maximizing likelihood.
Why Log?
- Products become sums: log(a × b) = log(a) + log(b)
- Numerical stability: Avoids underflow with many small probabilities
- Easier derivatives: Sums are easier to differentiate than products
Log-Likelihood for Coin Example
L(p) = p^7 × (1-p)^3
ℓ(p) = 7×log(p) + 3×log(1-p)
Much easier to work with!
MLE for Multiple Observations
With independent observations x₁, x₂, ..., xₙ:
L(θ) = P(x₁ | θ) × P(x₂ | θ) × ... × P(xₙ | θ)
= ∏ P(xᵢ | θ)
Log-likelihood:
ℓ(θ) = Σ log P(xᵢ | θ)
Common MLE Examples
Bernoulli (Binary Outcomes)
Data: k successes in n trials
MLE: p̂ = k/n
The sample proportion is the MLE.
Normal Distribution
Data: x₁, x₂, ..., xₙ
MLE: μ̂ = sample mean = (1/n) Σ xᵢ MLE: σ̂² = sample variance = (1/n) Σ (xᵢ - μ̂)²
Note: MLE for variance divides by n, not (n-1). The (n-1) version is unbiased but not MLE.
Exponential Distribution
Data: x₁, x₂, ..., xₙ (times until events)
MLE: λ̂ = n / Σ xᵢ = 1 / mean(x)
Poisson Distribution
Data: counts k₁, k₂, ..., kₙ
MLE: λ̂ = mean of counts = (1/n) Σ kᵢ
MLE for Machine Learning Models
Linear Regression
Model: y = wx + b + ε, where ε ~ Normal(0, σ²)
Likelihood: Product of Gaussian probabilities
MLE for w, b: Minimizes sum of squared errors!
MLE = argmin Σ (yᵢ - (wxᵢ + b))²
Key insight: Least squares regression IS maximum likelihood under Gaussian noise.
Logistic Regression
Model: P(y=1 | x) = sigmoid(wx + b)
Likelihood: Product of Bernoulli probabilities
Log-likelihood:
ℓ(w, b) = Σ [yᵢ × log(p̂ᵢ) + (1-yᵢ) × log(1-p̂ᵢ)]
Maximizing this is the same as minimizing binary cross-entropy loss!
Neural Networks
For classification, we maximize:
ℓ(θ) = Σ log P(correct class | input; θ)
Which is equivalent to minimizing cross-entropy loss.
The MLE Recipe
- Write down the likelihood: P(data | parameters)
- Take the log: ℓ(θ) = log L(θ)
- Take the derivative: dℓ/dθ
- Set to zero and solve: dℓ/dθ = 0
For complex models (like neural networks), we use gradient ascent instead of solving analytically.
Properties of MLE
Consistency
As sample size n → ∞, MLE converges to true parameters.
Asymptotic Normality
For large n, MLE is approximately normally distributed around the true value.
Efficiency
MLE achieves the lowest possible variance among unbiased estimators (Cramér-Rao bound).
Invariance
If θ̂ is the MLE for θ, then g(θ̂) is the MLE for g(θ).
Example: If σ̂² is the MLE for variance, then σ̂ = √(σ̂²) is the MLE for standard deviation.
Limitations of MLE
Overfitting
With limited data, MLE can overfit:
- Seeing 1 head in 1 flip → MLE says p = 1.0
- This is likely wrong!
Solution: Regularization or Bayesian methods.
No Uncertainty Quantification
MLE gives point estimates, not distributions.
Solution: Bayesian inference provides full posterior distributions.
Local Optima
For complex models, there may be multiple local maxima.
Solution: Multiple random initializations, careful optimization.
From MLE to Modern Deep Learning
The connection is direct:
| MLE Concept | Deep Learning Equivalent |
|---|---|
| Likelihood | Forward pass probability |
| Log-likelihood | Negative loss (inverted) |
| Maximizing likelihood | Minimizing loss |
| Gradient of log-likelihood | Gradients in backprop |
| MLE parameters | Trained weights |
When you train a neural network with cross-entropy loss, you're doing MLE!
Summary
- MLE finds parameters that maximize the probability of observed data
- The likelihood function treats data as fixed, parameters as variable
- Log-likelihood is more convenient: products become sums
- MLE for Gaussians gives least squares regression
- MLE for classification gives cross-entropy loss
- Deep learning training is essentially MLE with gradient descent
- MLE is consistent and efficient but can overfit with little data
Next, we'll see MLE in practice—how it's used to train AI models.

