What is Maximum Likelihood Estimation?

Maximum Likelihood Estimation (MLE) is the mathematical foundation of how AI learns from data. It provides a principled way to find the best parameters for a model by asking: "What parameters make the observed data most probable?"

The Core Idea

Given observed data, MLE finds the parameter values that maximize the probability of seeing that data.

Question: "Given what I observed, what parameters would have made this outcome most likely?"

Answer: The maximum likelihood estimates.

A Simple Example: Coin Flipping

You flip a coin 10 times and observe: HHTHHTHHTH (7 heads, 3 tails)

Question: What's the most likely value of p (probability of heads)?

Intuition

If p = 0.5 (fair coin): Getting 7 heads is possible but not the most likely outcome
If p = 0.7: Getting 7 heads in 10 flips is more likely
If p = 1.0: We'd never see tails, so this can't be right

The Math

The probability of this specific sequence, given p:

P(data | p) = p^7 × (1-p)^3

This is the likelihood function L(p).

To maximize, take the derivative and set to zero:

d/dp [p^7 × (1-p)^3] = 0

Solution: p = 7/10 = 0.7

The MLE estimate is p̂ = 0.7 — exactly the observed proportion!

The Likelihood Function

The likelihood function L(θ) gives the probability of observed data given parameters θ:

L(θ) = P(data | θ)

Key insight: Same formula as probability, but we're now treating the data as fixed and the parameters as variable.

Probability vs. Likelihood

Probability: Fixed parameters, ask about different data outcomes

"If the coin has p=0.5, what's the probability of 7 heads?"

Likelihood: Fixed data, ask about different parameter values

"Given we saw 7 heads, how likely is p=0.5 vs. p=0.7?"

Log-Likelihood

For numerical stability and mathematical convenience, we often work with the log-likelihood:

ℓ(θ) = log L(θ)

Since log is monotonically increasing, maximizing log-likelihood gives the same answer as maximizing likelihood.

Why Log?

Products become sums: log(a × b) = log(a) + log(b)
Numerical stability: Avoids underflow with many small probabilities
Easier derivatives: Sums are easier to differentiate than products

Log-Likelihood for Coin Example

L(p) = p^7 × (1-p)^3
ℓ(p) = 7×log(p) + 3×log(1-p)

Much easier to work with!

MLE for Multiple Observations

With independent observations x₁, x₂, ..., xₙ:

L(θ) = P(x₁ | θ) × P(x₂ | θ) × ... × P(xₙ | θ)
     = ∏ P(xᵢ | θ)

Log-likelihood:

ℓ(θ) = Σ log P(xᵢ | θ)

Common MLE Examples

Bernoulli (Binary Outcomes)

Data: k successes in n trials

MLE: p̂ = k/n

The sample proportion is the MLE.

Normal Distribution

Data: x₁, x₂, ..., xₙ

MLE: μ̂ = sample mean = (1/n) Σ xᵢ MLE: σ̂² = sample variance = (1/n) Σ (xᵢ - μ̂)²

Note: MLE for variance divides by n, not (n-1). The (n-1) version is unbiased but not MLE.

Exponential Distribution

Data: x₁, x₂, ..., xₙ (times until events)

MLE: λ̂ = n / Σ xᵢ = 1 / mean(x)

Poisson Distribution

Data: counts k₁, k₂, ..., kₙ

MLE: λ̂ = mean of counts = (1/n) Σ kᵢ

MLE for Machine Learning Models

Linear Regression

Model: y = wx + b + ε, where ε ~ Normal(0, σ²)

Likelihood: Product of Gaussian probabilities

MLE for w, b: Minimizes sum of squared errors!

MLE = argmin Σ (yᵢ - (wxᵢ + b))²

Key insight: Least squares regression IS maximum likelihood under Gaussian noise.

Logistic Regression

Model: P(y=1 | x) = sigmoid(wx + b)

Likelihood: Product of Bernoulli probabilities

Log-likelihood:

ℓ(w, b) = Σ [yᵢ × log(p̂ᵢ) + (1-yᵢ) × log(1-p̂ᵢ)]

Maximizing this is the same as minimizing binary cross-entropy loss!

Neural Networks

For classification, we maximize:

ℓ(θ) = Σ log P(correct class | input; θ)

Which is equivalent to minimizing cross-entropy loss.

The MLE Recipe

Write down the likelihood: P(data | parameters)
Take the log: ℓ(θ) = log L(θ)
Take the derivative: dℓ/dθ
Set to zero and solve: dℓ/dθ = 0

For complex models (like neural networks), we use gradient ascent instead of solving analytically.

Properties of MLE

Consistency

As sample size n → ∞, MLE converges to true parameters.

Asymptotic Normality

For large n, MLE is approximately normally distributed around the true value.

Efficiency

MLE achieves the lowest possible variance among unbiased estimators (Cramér-Rao bound).

Invariance

If θ̂ is the MLE for θ, then g(θ̂) is the MLE for g(θ).

Example: If σ̂² is the MLE for variance, then σ̂ = √(σ̂²) is the MLE for standard deviation.

Limitations of MLE

Overfitting

With limited data, MLE can overfit:

Seeing 1 head in 1 flip → MLE says p = 1.0
This is likely wrong!

Solution: Regularization or Bayesian methods.

No Uncertainty Quantification

MLE gives point estimates, not distributions.

Solution: Bayesian inference provides full posterior distributions.

Local Optima

For complex models, there may be multiple local maxima.

Solution: Multiple random initializations, careful optimization.

From MLE to Modern Deep Learning

The connection is direct:

MLE Concept	Deep Learning Equivalent
Likelihood	Forward pass probability
Log-likelihood	Negative loss (inverted)
Maximizing likelihood	Minimizing loss
Gradient of log-likelihood	Gradients in backprop
MLE parameters	Trained weights

When you train a neural network with cross-entropy loss, you're doing MLE!

Summary

MLE finds parameters that maximize the probability of observed data
The likelihood function treats data as fixed, parameters as variable
Log-likelihood is more convenient: products become sums
MLE for Gaussians gives least squares regression
MLE for classification gives cross-entropy loss
Deep learning training is essentially MLE with gradient descent
MLE is consistent and efficient but can overfit with little data

Next, we'll see MLE in practice—how it's used to train AI models.

What is Maximum Likelihood Estimation?

The Core Idea

Given observed data, MLE finds the parameter values that maximize the probability of seeing that data.

Question: "Given what I observed, what parameters would have made this outcome most likely?"

Answer: The maximum likelihood estimates.

A Simple Example: Coin Flipping

You flip a coin 10 times and observe: HHTHHTHHTH (7 heads, 3 tails)

Question: What's the most likely value of p (probability of heads)?

Intuition

If p = 0.5 (fair coin): Getting 7 heads is possible but not the most likely outcome
If p = 0.7: Getting 7 heads in 10 flips is more likely
If p = 1.0: We'd never see tails, so this can't be right

The Math

The probability of this specific sequence, given p:

P(data | p) = p^7 × (1-p)^3

This is the likelihood function L(p).

To maximize, take the derivative and set to zero:

d/dp [p^7 × (1-p)^3] = 0

Solution: p = 7/10 = 0.7

The MLE estimate is p̂ = 0.7 — exactly the observed proportion!

The Likelihood Function

The likelihood function L(θ) gives the probability of observed data given parameters θ:

L(θ) = P(data | θ)

Key insight: Same formula as probability, but we're now treating the data as fixed and the parameters as variable.

Probability vs. Likelihood

Probability: Fixed parameters, ask about different data outcomes

"If the coin has p=0.5, what's the probability of 7 heads?"

Likelihood: Fixed data, ask about different parameter values

"Given we saw 7 heads, how likely is p=0.5 vs. p=0.7?"

Log-Likelihood

For numerical stability and mathematical convenience, we often work with the log-likelihood:

ℓ(θ) = log L(θ)

Since log is monotonically increasing, maximizing log-likelihood gives the same answer as maximizing likelihood.

Why Log?

Products become sums: log(a × b) = log(a) + log(b)
Numerical stability: Avoids underflow with many small probabilities
Easier derivatives: Sums are easier to differentiate than products

Log-Likelihood for Coin Example

L(p) = p^7 × (1-p)^3
ℓ(p) = 7×log(p) + 3×log(1-p)

Much easier to work with!

MLE for Multiple Observations

With independent observations x₁, x₂, ..., xₙ:

L(θ) = P(x₁ | θ) × P(x₂ | θ) × ... × P(xₙ | θ)
     = ∏ P(xᵢ | θ)

Log-likelihood:

ℓ(θ) = Σ log P(xᵢ | θ)

Common MLE Examples

Bernoulli (Binary Outcomes)

Data: k successes in n trials

MLE: p̂ = k/n

The sample proportion is the MLE.

Normal Distribution

Data: x₁, x₂, ..., xₙ

MLE: μ̂ = sample mean = (1/n) Σ xᵢ MLE: σ̂² = sample variance = (1/n) Σ (xᵢ - μ̂)²

Note: MLE for variance divides by n, not (n-1). The (n-1) version is unbiased but not MLE.

Exponential Distribution

Data: x₁, x₂, ..., xₙ (times until events)

MLE: λ̂ = n / Σ xᵢ = 1 / mean(x)

Poisson Distribution

Data: counts k₁, k₂, ..., kₙ

MLE: λ̂ = mean of counts = (1/n) Σ kᵢ

MLE for Machine Learning Models

Linear Regression

Model: y = wx + b + ε, where ε ~ Normal(0, σ²)

Likelihood: Product of Gaussian probabilities

MLE for w, b: Minimizes sum of squared errors!

MLE = argmin Σ (yᵢ - (wxᵢ + b))²

Key insight: Least squares regression IS maximum likelihood under Gaussian noise.

Logistic Regression

Model: P(y=1 | x) = sigmoid(wx + b)

Likelihood: Product of Bernoulli probabilities

Log-likelihood:

ℓ(w, b) = Σ [yᵢ × log(p̂ᵢ) + (1-yᵢ) × log(1-p̂ᵢ)]

Maximizing this is the same as minimizing binary cross-entropy loss!

Neural Networks

For classification, we maximize:

ℓ(θ) = Σ log P(correct class | input; θ)

Which is equivalent to minimizing cross-entropy loss.

The MLE Recipe

Write down the likelihood: P(data | parameters)
Take the log: ℓ(θ) = log L(θ)
Take the derivative: dℓ/dθ
Set to zero and solve: dℓ/dθ = 0

For complex models (like neural networks), we use gradient ascent instead of solving analytically.

Properties of MLE

Consistency

As sample size n → ∞, MLE converges to true parameters.

Asymptotic Normality

For large n, MLE is approximately normally distributed around the true value.

Efficiency

MLE achieves the lowest possible variance among unbiased estimators (Cramér-Rao bound).

Invariance

If θ̂ is the MLE for θ, then g(θ̂) is the MLE for g(θ).

Example: If σ̂² is the MLE for variance, then σ̂ = √(σ̂²) is the MLE for standard deviation.

Limitations of MLE

Overfitting

With limited data, MLE can overfit:

Seeing 1 head in 1 flip → MLE says p = 1.0
This is likely wrong!

Solution: Regularization or Bayesian methods.

No Uncertainty Quantification

MLE gives point estimates, not distributions.

Solution: Bayesian inference provides full posterior distributions.

Local Optima

For complex models, there may be multiple local maxima.

Solution: Multiple random initializations, careful optimization.

From MLE to Modern Deep Learning

The connection is direct:

MLE Concept	Deep Learning Equivalent
Likelihood	Forward pass probability
Log-likelihood	Negative loss (inverted)
Maximizing likelihood	Minimizing loss
Gradient of log-likelihood	Gradients in backprop
MLE parameters	Trained weights

When you train a neural network with cross-entropy loss, you're doing MLE!

Summary

MLE finds parameters that maximize the probability of observed data
The likelihood function treats data as fixed, parameters as variable
Log-likelihood is more convenient: products become sums
MLE for Gaussians gives least squares regression
MLE for classification gives cross-entropy loss
Deep learning training is essentially MLE with gradient descent
MLE is consistent and efficient but can overfit with little data

Next, we'll see MLE in practice—how it's used to train AI models.