Introduction to Bayes' Theorem
Bayes' theorem is arguably the most important formula in all of AI and machine learning. It provides a mathematically rigorous way to update our beliefs when we receive new evidence—exactly what AI systems need to do.
The Problem Bayes Solves
We often know P(B | A) but need P(A | B):
-
We know P(Positive Test | Disease) — how often sick people test positive
-
We need P(Disease | Positive Test) — how likely is disease given a positive test
-
We know P(Evidence | Hypothesis) — how the world looks if our hypothesis is true
-
We need P(Hypothesis | Evidence) — how likely our hypothesis is given what we observe
Bayes' theorem lets us flip the conditioning.
The Formula
Bayes' Theorem:
P(A | B) = P(B | A) × P(A) / P(B)
Or more explicitly:
P(Hypothesis | Evidence) = P(Evidence | Hypothesis) × P(Hypothesis) / P(Evidence)
Understanding Each Term
| Term | Name | Meaning |
|---|---|---|
| P(A | B) | Posterior | Probability of A after seeing B |
| P(B | A) | Likelihood | Probability of B if A is true |
| P(A) | Prior | Probability of A before seeing B |
| P(B) | Evidence | Total probability of B occurring |
The Story Behind the Names
- Prior: What we believed before seeing any evidence
- Likelihood: How well the evidence fits each hypothesis
- Evidence: How likely the evidence is overall (normalizing factor)
- Posterior: Our updated belief after seeing the evidence
A Classic Example: Medical Diagnosis
Let's work through the medical testing example that shows why Bayes' theorem is so important.
Given:
- 1% of the population has a disease: P(Disease) = 0.01
- The test correctly identifies sick people 99% of the time: P(Positive | Disease) = 0.99
- The test correctly identifies healthy people 95% of the time: P(Negative | No Disease) = 0.95
Question: If someone tests positive, what's the probability they have the disease?
Most people guess 99% or 95%. The real answer is much lower.
Step-by-Step Solution
First, let's find P(Positive):
P(Positive) = P(Positive | Disease) × P(Disease) + P(Positive | No Disease) × P(No Disease)
= 0.99 × 0.01 + 0.05 × 0.99
= 0.0099 + 0.0495
= 0.0594
Now apply Bayes' theorem:
P(Disease | Positive) = P(Positive | Disease) × P(Disease) / P(Positive)
= 0.99 × 0.01 / 0.0594
= 0.0099 / 0.0594
= 0.167 (about 17%)
Only a 17% chance of having the disease despite a positive test!
Why This Happens
The disease is rare (1% of people). Even with a good test, most positive results come from the 99% of healthy people (5% of whom test positive) rather than the 1% of sick people (99% of whom test positive).
This is called the base rate fallacy—ignoring how common or rare something is when interpreting evidence.
Bayes' Theorem with Total Probability
A useful form combines Bayes with the law of total probability:
P(A | B) = P(B | A) × P(A) / [P(B | A) × P(A) + P(B | not A) × P(not A)]
This explicitly shows how the denominator is calculated by considering all ways B can occur.
Odds Form of Bayes
For comparing hypotheses, the odds form is powerful:
Posterior Odds = Likelihood Ratio × Prior Odds
Where:
- Odds = P(A) / P(not A)
- Likelihood Ratio = P(B | A) / P(B | not A)
Example: Email Spam
Prior odds of spam: 1:4 (20% of emails are spam)
If the email contains "FREE MONEY":
- P("FREE MONEY" | Spam) = 0.8
- P("FREE MONEY" | Not Spam) = 0.01
Likelihood ratio = 0.8 / 0.01 = 80
Posterior odds = 80 × (1/4) = 20:1
P(Spam | "FREE MONEY") = 20/21 ≈ 95.2%
Bayes as Learning
You can think of Bayes' theorem as a learning process:
What I believe now = (How well evidence fits belief × What I believed before) / Evidence strength
- Start with a prior belief
- Observe evidence
- Update to get a posterior belief
- The posterior becomes the new prior for future updates
This is exactly how Bayesian machine learning works!
Intuitive Understanding
Here are some intuitions about Bayes' theorem:
Strong prior, weak evidence: Your belief doesn't change much
- You're very confident it won't rain (strong prior)
- You see one dark cloud (weak evidence)
- You still think it probably won't rain
Weak prior, strong evidence: Your belief changes a lot
- You're unsure if an email is spam (weak prior)
- It contains "Nigerian prince" (strong evidence)
- You're now confident it's spam
Evidence that could support multiple hypotheses: Less conclusive
- A cough could mean a cold, flu, or allergies
- The cough alone doesn't strongly update toward any single hypothesis
Why AI Needs Bayes
Bayes' theorem appears throughout AI:
| Application | How Bayes is Used |
|---|---|
| Spam filters | P(Spam | Email features) |
| Medical diagnosis | P(Disease | Symptoms) |
| Recommendation systems | P(Like item | User history) |
| Language models | P(Word | Context) |
| Robot localization | P(Position | Sensor readings) |
| A/B testing | P(Version better | Click data) |
Common Pitfalls
Ignoring Base Rates
Always consider the prior! A rare event remains unlikely even with positive evidence.
Confusing Likelihoods
P(Evidence | Hypothesis) ≠ P(Hypothesis | Evidence)
Just because evidence is common when the hypothesis is true doesn't mean the hypothesis is likely when evidence appears.
Forgetting to Normalize
The denominator P(B) ensures the posterior is a valid probability (between 0 and 1).
Summary
- Bayes' theorem lets us compute P(A | B) from P(B | A)
- Prior: Initial belief. Likelihood: How well evidence fits. Posterior: Updated belief
- The base rate (prior) is crucial and often overlooked
- Bayes' theorem is the foundation for learning from data
- AI systems constantly use Bayesian reasoning to update predictions
Next, we'll see how to update beliefs sequentially as new evidence arrives.

