Probability

Bayesian Inference

Bayesian inference updates a prior distribution over unknowns into a posterior by multiplying by the likelihood and normalizing.

status: publishedimportance: importantdifficulty 3/5math: undergraduateread: 17mlive demo
Editorial probability illustration of prior and likelihood curves combining into a sharper posterior belief.

Concept Structure

Bayesian Inference

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

2prerequisites
1next concepts
2related links

Learning map

Bayesian Inference
BeforeDistributionsNow4/4 sections readyTryManipulate one control and predict the visible change.NextVariational Autoencoders

Object flow

4/4 sections readyAsk about thisResearch room
ConceptBayesian InferenceProbability
2 sources attachedLocal snapshot ready
concept:probability/bayesian-inference

Conceptual Bridge

What should feel connected as you move through this page.

Carry inDistributions

Bring the mental model from Distributions; this page will reuse it instead of restarting from zero.

Work hereBayesian Inference

Bayesian inference updates a prior distribution over unknowns into a posterior by multiplying by the likelihood and normalizing.

Carry outVariational Autoencoders

The next edge should feel earned: use the demo prediction here before following Variational Autoencoders.

Test the linkManipulate one control and predict the visible change.Then continue to Variational Autoencoders
01

01

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Maximum likelihood asks: which parameter makes the data look most probable?

Bayesian inference asks a different question: after seeing the data, what should we believe about each possible parameter?

That difference matters when data is scarce or uncertainty matters. The likelihood is a score of parameters by data fit. A prior is what you believed before the data. A posterior is the updated distribution after both forces are combined.

For a coin with unknown head probability θ\theta, maximum likelihood might pick one number such as θ^=0.8\hat\theta=0.8. Bayesian inference keeps a whole distribution over plausible θ\theta values. With little data, the prior can still matter. With lots of data, and with a prior that gives nonzero density near the data-favored values, the likelihood usually concentrates the posterior near the parameters that explain the observations.

The mental model is not "Bayes is MLE plus vibes." It is a different object: MLE returns an estimate, while Bayesian inference returns a distribution over the unknown quantity.

02

02

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Let θΘ\theta\in\Theta be an unknown parameter and let DD be observed data. Bayesian inference starts with a prior distribution or density p(θ)p(\theta) and a likelihood p(Dθ)p(D\mid\theta). Bayes' rule gives the posterior:

p(θD)=p(Dθ)p(θ)p(D).p(\theta\mid D)=\frac{p(D\mid\theta)p(\theta)}{p(D)}.

For a continuous parameter, the denominator

p(D)=p(Dθ)p(θ)dθp(D)=\int p(D\mid\theta)p(\theta)\,d\theta

is the evidence or marginal likelihood. Its job is to normalize the posterior so it integrates to 11.

If θ\theta lives in a discrete set, replace the integral with a sum. The job is the same: add up prior-weighted likelihood over all possible parameter values.

For parameter comparison, the key proportional form is

p(θD)p(Dθ)p(θ).p(\theta\mid D)\propto p(D\mid\theta)p(\theta).

This is the central contrast with maximum likelihood. The likelihood p(Dθ)p(D\mid\theta) is a function of θ\theta, but it is not automatically a probability distribution over θ\theta. Multiplying by a prior and normalizing turns it into the posterior distribution.

For a coin with unknown head probability θ[0,1]\theta\in[0,1], suppose the data has hh heads and tt tails, with n=h+tn=h+t. The likelihood shape is

p(Dθ)θh(1θ)t.p(D\mid\theta)\propto \theta^h(1-\theta)^t.

If DD records only the count of heads, the full binomial likelihood also has a factor (nh)\binom{n}{h}. That factor does not depend on θ\theta, so it disappears in proportional calculations.

If the prior is a beta distribution with α,β>0\alpha,\beta>0,

θBeta(α,β),\theta\sim\operatorname{Beta}(\alpha,\beta),

then

p(θ)θα1(1θ)β1.p(\theta)\propto \theta^{\alpha-1}(1-\theta)^{\beta-1}.

Multiplying prior and likelihood gives

p(θD)θα+h1(1θ)β+t1,p(\theta\mid D)\propto \theta^{\alpha+h-1}(1-\theta)^{\beta+t-1},

so the posterior is

θDBeta(α+h,β+t).\theta\mid D\sim\operatorname{Beta}(\alpha+h,\beta+t).

The maximum-likelihood estimate is

θ^MLE=hh+t\hat\theta_{\mathrm{MLE}}=\frac{h}{h+t}

when h+t>0h+t>0. The posterior mean is

E[θD]=α+hα+β+h+t.\mathbb E[\theta\mid D]=\frac{\alpha+h}{\alpha+\beta+h+t}.

When n>0n>0, the posterior mean can also be written as

E[θD]=α+βα+β+nαα+β+nα+β+nhn.\mathbb E[\theta\mid D]=\frac{\alpha+\beta}{\alpha+\beta+n}\cdot\frac{\alpha}{\alpha+\beta}+\frac{n}{\alpha+\beta+n}\cdot\frac{h}{n}.

This is the clean bridge to maximum likelihood: the posterior mean is a weighted compromise between the prior mean and the MLE. The posterior distribution is the richer object; the mean is only one summary of it.

It is safe to think of α+β\alpha+\beta as prior strength for this averaging formula. The density exponents are α1\alpha-1 and β1\beta-1, so pseudo-count language is only a mnemonic. The exact conjugate update is the parameter update Beta(α,β)Beta(α+h,β+t)\operatorname{Beta}(\alpha,\beta)\to\operatorname{Beta}(\alpha+h,\beta+t).

03

03

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt
import math
import numpy as np

def log_beta_fn(a, b):
    return math.lgamma(a) + math.lgamma(b) - math.lgamma(a + b)

def log_beta_pdf(theta, a, b):
    if theta <= 0 or theta >= 1:
        return -math.inf
    return (a - 1) * math.log(theta) + (b - 1) * math.log(1 - theta) - log_beta_fn(a, b)

alpha, beta = 2.0, 2.0
heads, tails = 8, 2

post_alpha = alpha + heads
post_beta = beta + tails
n = heads + tails

mle = None if n == 0 else heads / n
prior_mean = alpha / (alpha + beta)
posterior_mean = post_alpha / (post_alpha + post_beta)

print("MLE:", "undefined" if mle is None else round(mle, 3))
print("prior mean:", round(prior_mean, 3))
print("posterior mean:", round(posterior_mean, 3))
print("posterior:", f"Beta({post_alpha:.1f}, {post_beta:.1f})")

# Grid approximation to show prior * likelihood -> posterior shape.
grid = np.linspace(0.01, 0.99, 99)
log_prior = np.array([log_beta_pdf(theta, alpha, beta) for theta in grid])
log_likelihood = heads * np.log(grid) + tails * np.log(1 - grid)
log_unnormalized_posterior = log_prior + log_likelihood

# Normalize on the grid for inspection.
weights = np.exp(log_unnormalized_posterior - log_unnormalized_posterior.max())
weights = weights / weights.sum()
grid_mean = float(np.sum(grid * weights))

print("grid posterior mean approx:", round(grid_mean, 3))

The code mirrors the math: the prior is a density over θ\theta, the likelihood scores the observed heads and tails for each θ\theta, and the posterior beta parameters add the observed counts to the prior parameters.

04

04

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the sliders and presets to compare prior strength against observed data. The prior and posterior curves are densities over θ\theta. The likelihood curve is normalized only for display so its shape can be compared on the same plot.

Try the "strong prior, little data" preset, then the "data wins" preset. The MLE only follows the observed fraction. The posterior mean moves between prior belief and data fit, and the full posterior curve shows how much uncertainty remains.

Live Concept Demo

Explore Bayesian Inference

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 3/5undergraduatecode-aligned
Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Bayesian Inference should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

Bayesian inference updates a prior distribution over unknowns into a posterior by multiplying by the likelihood and normalizing.

Prediction open01 / Intuition
Editorial probability illustration of prior and likelihood curves combining into a sharper posterior belief.
Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Bayesian Inference should make visible.

Visual Inquiry

Make the image answer a mathematical question

Bayesian inference updates a prior distribution over unknowns into a posterior by multiplying by the likelihood and normalizing.

4/4 stages readyLive demo connected
Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Bayesian Inference easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

book · 2020Mathematics for Machine LearningDeisenroth, Faisal, and Ong

Grounds the probability notation and Bayes-rule prerequisites needed for the page.

Open source
book · 2022Probabilistic Machine Learning: An IntroductionMurphy

Grounds Bayesian inference as posterior updating under probabilistic models.

Open source

Claim Review

Bayesian inference updates a prior distribution over unknowns into a posterior by multiplying by the likelihood and normalizing.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources2 references

deisenroth-2020-mml, murphy-2022-probabilistic-ml

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedBayesian inference updates a prior distribution over unknown parameters by multiplying it with the likelihood of observed data and normalizing by evidence, producing a posterior distribution rather than a single maximum-likelihood estimate.Claim metadata: source checked

Deisenroth et al. ground Bayes' rule as posterior = likelihood times prior divided by evidence and define evidence as the posterior normalizer; Murphy grounds Bayesian parameter inference as updating p(theta) with p(D|theta) and normalizing by marginal likelihood, while contrasting this with MLE as a point estimate.

Sources: Mathematics for Machine Learning, Probabilistic Machine Learning: An IntroductionChecks posterior updating and MLE contrast only; not approximate inference, hierarchy, asymptotics, posterior predictive decisions, or universal prior-strength advice. Code/demo are beta-Bernoulli witnesses; omitted binomial coefficients are theta-independent.A bounded review summary is present; still check caveats and exact source scope.

MML supports Bayes-rule notation with prior, likelihood, evidence, and posterior labels, defines evidence/marginal likelihood as the normalizer, and notes that the full posterior retains more information than focusing on a maximum/statistic. Murphy directly supports parameter-level Bayesian updating as p(theta|D) proportional to p(theta)p(D|theta) normalized by p(D), contrasts this with MLE as a single likelihood-maximizing parameter estimate, and gives the beta-Bernoulli/binomial posterior update used by the page.

Reviewer: codex+oracle; reviewed 2026-05-07

Practice Loop

Try the idea before it explains itself

Bayesian inference updates a prior distribution over unknowns into a posterior by multiplying by the likelihood and normalizing.

Readiness0/3 checks ready
Predict

Before touching the demo, predict one visible change that should happen in Bayesian Inference.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Object research drawerClose
ConceptBayesian InferenceProbability

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.
Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptProbability

Bayesian Inference

Anchored question

What is the smallest example that makes Bayesian Inference click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action
Local action draft

This draft stays locally in this browser for concept:probability/bayesian-inference.

No local draft saved.
Evidence to inspect
  • Source ids to inspect: deisenroth-2020-mml, murphy-2022-probabilistic-ml
  • Definition, prerequisite, and contrast concept links
  • The equation or code witness that makes the concept operational
  • One demo state that shows the invariant instead of a slogan
What would resolve this
  • The learner can state the mechanism in their own words
  • The learner can name the prerequisite that would repair confusion
  • The learner can predict how the mechanism changes under one perturbation
Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Bayesian Inference Object key: concept:probability/bayesian-inference Context: Probability Anchor id: concept/concept-notebook/probability/bayesian-inference Open question: What is the smallest example that makes Bayesian Inference click without losing the math? Evidence to inspect: - Source ids to inspect: deisenroth-2020-mml, murphy-2022-probabilistic-ml - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object
concept/concept-notebook/probability/bayesian-inference concept:probability/bayesian-inference