Introduction to Bayes' Theorem

Bayes' theorem is arguably the most important formula in all of AI and machine learning. It provides a mathematically rigorous way to update our beliefs when we receive new evidence—exactly what AI systems need to do.

The Problem Bayes Solves

We often know P(B | A) but need P(A | B):

We know P(Positive Test | Disease) — how often sick people test positive
We need P(Disease | Positive Test) — how likely is disease given a positive test
We know P(Evidence | Hypothesis) — how the world looks if our hypothesis is true
We need P(Hypothesis | Evidence) — how likely our hypothesis is given what we observe

Bayes' theorem lets us flip the conditioning.

The Formula

Bayes' Theorem:

P(A | B) = P(B | A) × P(A) / P(B)

Or more explicitly:

P(Hypothesis | Evidence) = P(Evidence | Hypothesis) × P(Hypothesis) / P(Evidence)

Understanding Each Term

Term	Name	Meaning
P(A \| B)	Posterior	Probability of A after seeing B
P(B \| A)	Likelihood	Probability of B if A is true
P(A)	Prior	Probability of A before seeing B
P(B)	Evidence	Total probability of B occurring

The Story Behind the Names

Prior: What we believed before seeing any evidence
Likelihood: How well the evidence fits each hypothesis
Evidence: How likely the evidence is overall (normalizing factor)
Posterior: Our updated belief after seeing the evidence

A Classic Example: Medical Diagnosis

Let's work through the medical testing example that shows why Bayes' theorem is so important.

Given:

1% of the population has a disease: P(Disease) = 0.01
The test correctly identifies sick people 99% of the time: P(Positive | Disease) = 0.99
The test correctly identifies healthy people 95% of the time: P(Negative | No Disease) = 0.95

Question: If someone tests positive, what's the probability they have the disease?

Most people guess 99% or 95%. The real answer is much lower.

Step-by-Step Solution

First, let's find P(Positive):

P(Positive) = P(Positive | Disease) × P(Disease) + P(Positive | No Disease) × P(No Disease)
            = 0.99 × 0.01 + 0.05 × 0.99
            = 0.0099 + 0.0495
            = 0.0594

Now apply Bayes' theorem:

P(Disease | Positive) = P(Positive | Disease) × P(Disease) / P(Positive)
                      = 0.99 × 0.01 / 0.0594
                      = 0.0099 / 0.0594
                      = 0.167 (about 17%)

Only a 17% chance of having the disease despite a positive test!

Why This Happens

The disease is rare (1% of people). Even with a good test, most positive results come from the 99% of healthy people (5% of whom test positive) rather than the 1% of sick people (99% of whom test positive).

This is called the base rate fallacy—ignoring how common or rare something is when interpreting evidence.

Bayes' Theorem with Total Probability

A useful form combines Bayes with the law of total probability:

P(A | B) = P(B | A) × P(A) / [P(B | A) × P(A) + P(B | not A) × P(not A)]

This explicitly shows how the denominator is calculated by considering all ways B can occur.

Odds Form of Bayes

For comparing hypotheses, the odds form is powerful:

Posterior Odds = Likelihood Ratio × Prior Odds

Where:

Odds = P(A) / P(not A)
Likelihood Ratio = P(B | A) / P(B | not A)

Example: Email Spam

Prior odds of spam: 1:4 (20% of emails are spam)

If the email contains "FREE MONEY":

P("FREE MONEY" | Spam) = 0.8
P("FREE MONEY" | Not Spam) = 0.01

Likelihood ratio = 0.8 / 0.01 = 80

Posterior odds = 80 × (1/4) = 20:1

P(Spam | "FREE MONEY") = 20/21 ≈ 95.2%

Bayes as Learning

You can think of Bayes' theorem as a learning process:

What I believe now = (How well evidence fits belief × What I believed before) / Evidence strength

Start with a prior belief
Observe evidence
Update to get a posterior belief
The posterior becomes the new prior for future updates

This is exactly how Bayesian machine learning works!

Intuitive Understanding

Here are some intuitions about Bayes' theorem:

Strong prior, weak evidence: Your belief doesn't change much

You're very confident it won't rain (strong prior)
You see one dark cloud (weak evidence)
You still think it probably won't rain

Weak prior, strong evidence: Your belief changes a lot

You're unsure if an email is spam (weak prior)
It contains "Nigerian prince" (strong evidence)
You're now confident it's spam

Evidence that could support multiple hypotheses: Less conclusive

A cough could mean a cold, flu, or allergies
The cough alone doesn't strongly update toward any single hypothesis

Why AI Needs Bayes

Bayes' theorem appears throughout AI:

Application	How Bayes is Used
Spam filters	P(Spam \| Email features)
Medical diagnosis	P(Disease \| Symptoms)
Recommendation systems	P(Like item \| User history)
Language models	P(Word \| Context)
Robot localization	P(Position \| Sensor readings)
A/B testing	P(Version better \| Click data)

Common Pitfalls

Ignoring Base Rates

Always consider the prior! A rare event remains unlikely even with positive evidence.

Confusing Likelihoods

P(Evidence | Hypothesis) ≠ P(Hypothesis | Evidence)

Just because evidence is common when the hypothesis is true doesn't mean the hypothesis is likely when evidence appears.

Forgetting to Normalize

The denominator P(B) ensures the posterior is a valid probability (between 0 and 1).

Summary

Bayes' theorem lets us compute P(A | B) from P(B | A)
Prior: Initial belief. Likelihood: How well evidence fits. Posterior: Updated belief
The base rate (prior) is crucial and often overlooked
Bayes' theorem is the foundation for learning from data
AI systems constantly use Bayesian reasoning to update predictions

Next, we'll see how to update beliefs sequentially as new evidence arrives.