Updating Beliefs with Evidence

Bayes' theorem isn't just a one-time calculation—it's a process. As new evidence arrives, we continuously update our beliefs. This sequential updating is fundamental to how AI systems learn and adapt.

The Bayesian Update Process

Start with a prior belief
Observe new evidence
Calculate the posterior using Bayes' theorem
The posterior becomes the new prior
Repeat when more evidence arrives

Prior₁ → Evidence₁ → Posterior₁ = Prior₂ → Evidence₂ → Posterior₂ → ...

Sequential Updates Example

Let's track how our belief about a coin evolves as we flip it.

Initial belief (prior): The coin is fair, P(Heads) = 0.5

Observation 1: First flip is Heads

We don't actually need fancy math here—one head doesn't tell us much. But let's say we're slightly more inclined to think the coin might favor heads:

P(biased toward heads | 1 head) slightly increases

Observation 2: Second flip is also Heads

Two heads in a row. Now we're a bit more suspicious:

P(biased toward heads | 2 heads) increases more

Observation 3-10: Eight more heads!

After 10 consecutive heads:

P(biased toward heads | 10 heads) is now very high
A fair coin would do this only 1/1024 ≈ 0.1% of the time

Each observation nudged our belief, and cumulatively, we've dramatically updated our view.

Mathematical Formulation

For sequential evidence E₁, E₂, E₃, ..., we update:

P(H | E₁) ∝ P(E₁ | H) × P(H)
P(H | E₁, E₂) ∝ P(E₂ | H) × P(H | E₁)
P(H | E₁, E₂, E₃) ∝ P(E₃ | H) × P(H | E₁, E₂)

Each posterior becomes the prior for the next update.

If evidence is conditionally independent given H:

P(H | E₁, E₂, ..., Eₙ) ∝ P(H) × P(E₁ | H) × P(E₂ | H) × ... × P(Eₙ | H)

Online Learning

This sequential updating is called online learning—the model updates its beliefs as data arrives, rather than waiting for all data.

Advantages of Online Learning:

Can adapt to new patterns
Memory efficient (don't need all historical data)
Natural for streaming data
Matches how humans learn

Examples in AI:

Spam filters updating with each email you mark
Recommendation systems learning from each click
Ad systems adjusting bids in real-time
Fraud detection updating with each transaction

Belief Trajectories

We can visualize how beliefs evolve over time:

Probability of Hypothesis A

1.0 |                                    ********
    |                              ******
0.8 |                        ******
    |                  ******
0.6 |            ******
    |      ******
0.4 |******
    |
0.2 |
    |
0.0 +---+---+---+---+---+---+---+---+---+---
    0   1   2   3   4   5   6   7   8   9  10
                  Evidence Count

With consistent evidence favoring A, belief in A grows over time.

Evidence Strength Matters

Not all evidence is created equal. Strong evidence causes larger updates:

Weak evidence: Consistent with multiple hypotheses

P(H | weak evidence) ≈ P(H)  // Small update

Strong evidence: Much more likely under one hypothesis

P(H | strong evidence) significantly differs from P(H)

Example: Is It Raining?

Weak evidence: Someone carrying an umbrella

Many people carry umbrellas on cloudy-but-dry days
Small update toward "raining"

Strong evidence: Everyone on the street is wet

Very unlikely unless it's raining
Large update toward "raining"

Prior Strength

Your prior also affects how much you update:

Strong prior (high confidence): Requires more evidence to change

"I'm almost certain the earth is round."
One person saying it's flat barely changes this belief.

Weak prior (uncertain): Changes easily with evidence

"I have no idea if this coin is fair."
A few flips significantly update my belief.

Mathematical Interpretation

A strong prior can be thought of as equivalent to having already seen a lot of evidence. If your prior represents "100 coin flips worth" of information, one more flip doesn't change much.

Confirmation and Disconfirmation

Evidence can support or oppose a hypothesis:

Confirming evidence: P(E | H) > P(E | not H)

Posterior increases

Disconfirming evidence: P(E | H) < P(E | not H)

Posterior decreases

Neutral evidence: P(E | H) = P(E | not H)

Posterior unchanged (evidence is uninformative)

Beta-Binomial Example

A classic model for tracking binary outcomes:

We want to estimate the probability θ that a coin lands heads.

Prior: Beta distribution with parameters (α, β)

α represents "prior heads"
β represents "prior tails"
Mean = α / (α + β)

After observing h heads and t tails:

Posterior: Beta distribution with parameters (α + h, β + t)

Just add the counts!

This is called a conjugate prior—the posterior has the same form as the prior, making updates simple.

Example: Estimating Click-Through Rate

Starting belief: α=1, β=1 (uniform prior, no preference)

After 10 clicks and 90 non-clicks:

Posterior: Beta(11, 91)
Mean estimate: 11/102 ≈ 10.8%

After 100 more clicks and 900 more non-clicks:

Posterior: Beta(111, 991)
Mean estimate: 111/1102 ≈ 10.1%

Our uncertainty decreases as we see more data.

Handling Contradictory Evidence

What happens when evidence points in different directions?

Scenario: First 5 coin flips are heads, next 5 are tails.

The updates partially cancel out:

After heads: Belief shifts toward "biased to heads"
After tails: Belief shifts back toward "fair" or "biased to tails"
Final: Similar to starting point, but with more confidence it's somewhere near fair

This is appropriate—contradictory evidence suggests neither extreme hypothesis is correct.

Forgetting and Recency

Sometimes older evidence should count less:

Stationary world: All evidence equally relevant

Standard Bayesian updating works fine

Non-stationary world: Patterns change over time

Recent evidence should matter more
Use "exponential forgetting" or "sliding windows"

Example: User Preferences

A user's music preferences 5 years ago may not predict today's preferences. Modern recommendation systems discount old interactions.

Practical Considerations

Computational Efficiency

Storing full distributions is expensive. Common approximations:

Keep only mean and variance (moment matching)
Use particle filters (samples from distribution)
Variational inference (approximate with simpler distribution)

Numerical Stability

Multiplying many small probabilities causes underflow. Solutions:

Work in log space: log P(H | E) = log P(E | H) + log P(H) - log P(E)
Normalize periodically

Summary

Bayesian updating is a continuous process, not a one-time calculation
Each posterior becomes the prior for the next update
Online learning updates beliefs as data streams in
Strong evidence and weak priors lead to larger updates
Evidence can confirm, disconfirm, or be neutral
Conjugate priors (like Beta-Binomial) make updates mathematically elegant
Non-stationary environments may require forgetting old evidence

Next, we'll see how Bayes' theorem powers real AI applications.