Updating Beliefs with Evidence
Bayes' theorem isn't just a one-time calculation—it's a process. As new evidence arrives, we continuously update our beliefs. This sequential updating is fundamental to how AI systems learn and adapt.
The Bayesian Update Process
- Start with a prior belief
- Observe new evidence
- Calculate the posterior using Bayes' theorem
- The posterior becomes the new prior
- Repeat when more evidence arrives
Prior₁ → Evidence₁ → Posterior₁ = Prior₂ → Evidence₂ → Posterior₂ → ...
Sequential Updates Example
Let's track how our belief about a coin evolves as we flip it.
Initial belief (prior): The coin is fair, P(Heads) = 0.5
Observation 1: First flip is Heads
We don't actually need fancy math here—one head doesn't tell us much. But let's say we're slightly more inclined to think the coin might favor heads:
- P(biased toward heads | 1 head) slightly increases
Observation 2: Second flip is also Heads
Two heads in a row. Now we're a bit more suspicious:
- P(biased toward heads | 2 heads) increases more
Observation 3-10: Eight more heads!
After 10 consecutive heads:
- P(biased toward heads | 10 heads) is now very high
- A fair coin would do this only 1/1024 ≈ 0.1% of the time
Each observation nudged our belief, and cumulatively, we've dramatically updated our view.
Mathematical Formulation
For sequential evidence E₁, E₂, E₃, ..., we update:
P(H | E₁) ∝ P(E₁ | H) × P(H)
P(H | E₁, E₂) ∝ P(E₂ | H) × P(H | E₁)
P(H | E₁, E₂, E₃) ∝ P(E₃ | H) × P(H | E₁, E₂)
Each posterior becomes the prior for the next update.
If evidence is conditionally independent given H:
P(H | E₁, E₂, ..., Eₙ) ∝ P(H) × P(E₁ | H) × P(E₂ | H) × ... × P(Eₙ | H)
Online Learning
This sequential updating is called online learning—the model updates its beliefs as data arrives, rather than waiting for all data.
Advantages of Online Learning:
- Can adapt to new patterns
- Memory efficient (don't need all historical data)
- Natural for streaming data
- Matches how humans learn
Examples in AI:
- Spam filters updating with each email you mark
- Recommendation systems learning from each click
- Ad systems adjusting bids in real-time
- Fraud detection updating with each transaction
Belief Trajectories
We can visualize how beliefs evolve over time:
Probability of Hypothesis A
1.0 | ********
| ******
0.8 | ******
| ******
0.6 | ******
| ******
0.4 |******
|
0.2 |
|
0.0 +---+---+---+---+---+---+---+---+---+---
0 1 2 3 4 5 6 7 8 9 10
Evidence Count
With consistent evidence favoring A, belief in A grows over time.
Evidence Strength Matters
Not all evidence is created equal. Strong evidence causes larger updates:
Weak evidence: Consistent with multiple hypotheses
P(H | weak evidence) ≈ P(H) // Small update
Strong evidence: Much more likely under one hypothesis
P(H | strong evidence) significantly differs from P(H)
Example: Is It Raining?
Weak evidence: Someone carrying an umbrella
- Many people carry umbrellas on cloudy-but-dry days
- Small update toward "raining"
Strong evidence: Everyone on the street is wet
- Very unlikely unless it's raining
- Large update toward "raining"
Prior Strength
Your prior also affects how much you update:
Strong prior (high confidence): Requires more evidence to change
"I'm almost certain the earth is round."
One person saying it's flat barely changes this belief.
Weak prior (uncertain): Changes easily with evidence
"I have no idea if this coin is fair."
A few flips significantly update my belief.
Mathematical Interpretation
A strong prior can be thought of as equivalent to having already seen a lot of evidence. If your prior represents "100 coin flips worth" of information, one more flip doesn't change much.
Confirmation and Disconfirmation
Evidence can support or oppose a hypothesis:
Confirming evidence: P(E | H) > P(E | not H)
- Posterior increases
Disconfirming evidence: P(E | H) < P(E | not H)
- Posterior decreases
Neutral evidence: P(E | H) = P(E | not H)
- Posterior unchanged (evidence is uninformative)
Beta-Binomial Example
A classic model for tracking binary outcomes:
We want to estimate the probability θ that a coin lands heads.
Prior: Beta distribution with parameters (α, β)
- α represents "prior heads"
- β represents "prior tails"
- Mean = α / (α + β)
After observing h heads and t tails:
Posterior: Beta distribution with parameters (α + h, β + t)
- Just add the counts!
This is called a conjugate prior—the posterior has the same form as the prior, making updates simple.
Example: Estimating Click-Through Rate
Starting belief: α=1, β=1 (uniform prior, no preference)
After 10 clicks and 90 non-clicks:
- Posterior: Beta(11, 91)
- Mean estimate: 11/102 ≈ 10.8%
After 100 more clicks and 900 more non-clicks:
- Posterior: Beta(111, 991)
- Mean estimate: 111/1102 ≈ 10.1%
Our uncertainty decreases as we see more data.
Handling Contradictory Evidence
What happens when evidence points in different directions?
Scenario: First 5 coin flips are heads, next 5 are tails.
The updates partially cancel out:
- After heads: Belief shifts toward "biased to heads"
- After tails: Belief shifts back toward "fair" or "biased to tails"
- Final: Similar to starting point, but with more confidence it's somewhere near fair
This is appropriate—contradictory evidence suggests neither extreme hypothesis is correct.
Forgetting and Recency
Sometimes older evidence should count less:
Stationary world: All evidence equally relevant
- Standard Bayesian updating works fine
Non-stationary world: Patterns change over time
- Recent evidence should matter more
- Use "exponential forgetting" or "sliding windows"
Example: User Preferences
A user's music preferences 5 years ago may not predict today's preferences. Modern recommendation systems discount old interactions.
Practical Considerations
Computational Efficiency
Storing full distributions is expensive. Common approximations:
- Keep only mean and variance (moment matching)
- Use particle filters (samples from distribution)
- Variational inference (approximate with simpler distribution)
Numerical Stability
Multiplying many small probabilities causes underflow. Solutions:
- Work in log space: log P(H | E) = log P(E | H) + log P(H) - log P(E)
- Normalize periodically
Summary
- Bayesian updating is a continuous process, not a one-time calculation
- Each posterior becomes the prior for the next update
- Online learning updates beliefs as data streams in
- Strong evidence and weak priors lead to larger updates
- Evidence can confirm, disconfirm, or be neutral
- Conjugate priors (like Beta-Binomial) make updates mathematically elegant
- Non-stationary environments may require forgetting old evidence
Next, we'll see how Bayes' theorem powers real AI applications.

