Generative Models

Variational Autoencoders

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 20mlive demo
Editorial generative-model illustration of data encoded into a latent distribution and decoded back into reconstructed samples.

Concept Structure

Variational Autoencoders

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites
2next concepts
1related links

Learning map

Variational Autoencoders
BeforeMaximum LikelihoodNow4/4 sections readyTryManipulate one control and predict the visible change.NextNormalizing Flows: Tractable Density via Invertible Transforms

Object flow

4/4 sections readyAsk about thisResearch room
ConceptVariational AutoencodersGenerative Models
2 sources attachedLocal snapshot ready
concept:generative-models/vaes

Conceptual Bridge

What should feel connected as you move through this page.

Carry inMaximum Likelihood

Bring the mental model from Maximum Likelihood; this page will reuse it instead of restarting from zero.

Work hereVariational Autoencoders

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Carry outNormalizing Flows: Tractable Density via Invertible Transforms

The next edge should feel earned: use the demo prediction here before following Normalizing Flows: Tractable Density via Invertible Transforms.

Test the linkManipulate one control and predict the visible change.Then continue to Normalizing Flows: Tractable Density via Invertible Transforms
01

01

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Maximum likelihood asks for a model that assigns high log probability to the observed data:

logpθ(x).\log p_\theta(x).

A variational autoencoder is a maximum-likelihood model with a hidden cause. It imagines that each observation xx was produced from an unobserved latent variable zz:

zp(z),xpθ(xz).\begin{aligned} z &\sim p(z), \\ x &\sim p_\theta(x\mid z). \end{aligned}

Training would be easy if we could compute the marginal likelihood

pθ(x)=pθ(x,z)dz.\begin{aligned} p_\theta(x) &= \int p_\theta(x,z)\,dz. \end{aligned}

The problem is that this integral is usually hard for neural decoders. Bayesian inference tells us what we would like to use:

pθ(zx).p_\theta(z\mid x).

But that posterior contains the same hard evidence term pθ(x)p_\theta(x). The VAE move is to learn an encoder

qϕ(zx)q_\phi(z\mid x)

that acts as a tractable approximate posterior.

The central mechanism is not "an autoencoder with noise." It is: replace the intractable posterior with a learned distribution qϕ(zx)q_\phi(z\mid x), then optimize a lower bound whose looseness is exactly a KL divergence from qϕq_\phi to the true posterior.

That is why VAEs sit at the intersection of maximum likelihood, Bayesian inference, and KL divergence.

Two natural next moves branch from this page. Normalizing flows ask what changes when the latent transformation is invertible and likelihood stays exact. Diffusion asks what changes when generation is learned as a gradual denoising process rather than as one sampled latent code.

02

02

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Latent variable likelihood

For one observed datapoint xx, assume a prior p(z)p(z) and a decoder likelihood pθ(xz)p_\theta(x\mid z). The latent-variable model, marginal likelihood, and true posterior are

pθ(x,z)=p(z)pθ(xz),pθ(x)=pθ(x,z)dz,pθ(zx)=pθ(x,z)pθ(x).\begin{aligned} p_\theta(x,z)&=p(z)p_\theta(x\mid z),\\ p_\theta(x)&=\int p_\theta(x,z)\,dz,\\ p_\theta(z\mid x)&=\frac{p_\theta(x,z)}{p_\theta(x)}. \end{aligned}

The evidence pθ(x)p_\theta(x) appears in the denominator, so exact posterior inference and exact likelihood training are tied to the same difficult integral. The claim-bearing VAE move is summarized by

logpθ(x)=L(θ,ϕ;x)+KL ⁣(qϕ(zx)pθ(zx)),L(θ,ϕ;x)=Eqϕ(zx)[logpθ(xz)]KL ⁣(qϕ(zx)p(z)),ϵN(0,I),z=μϕ(x)+σϕ(x)ϵ.\begin{aligned} \log p_\theta(x) &= \mathcal L(\theta,\phi;x) + \mathrm{KL}\!\left(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)\right),\\ \mathcal L(\theta,\phi;x) &= \mathbb E_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - \mathrm{KL}\!\left(q_\phi(z\mid x)\,\|\,p(z)\right),\\ \epsilon&\sim\mathcal N(0,I), \qquad z=\mu_\phi(x)+\sigma_\phi(x)\odot\epsilon. \end{aligned}

The first line says the ELBO gap is exactly the KL from the learned encoder qϕ(zx)q_\phi(z\mid x) to the true posterior. The second line separates reconstruction likelihood from prior pressure. The third line keeps randomness in external noise ϵ\epsilon, so gradients of sampled reconstruction terms can flow through zz into the encoder outputs.

The ELBO identity

Let qϕ(zx)q_\phi(z\mid x) be any approximate posterior whose support is compatible with the true posterior. To keep the equations readable, write q(z)q(z) for qϕ(zx)q_\phi(z\mid x) and Eq\mathbb E_q for expectation under that distribution:

Eq[]Eq(z)[].\mathbb E_q[\cdot] \equiv \mathbb E_{q(z)}[\cdot].

Using Bayes' rule and rearranging gives the exact identity

logpθ(x)=L+G.\log p_\theta(x)=\mathcal L+G.

Here L\mathcal L is the ELBO and GG is the inference gap. Let p(z)p_*(z) denote the true posterior pθ(zx)p_\theta(z\mid x). Then the gap is

G=KL(q(z)p(z)).G=\mathrm{KL}(q(z)\,\|\,p_*(z)).

Equivalently,

G=Eq[logq(z)logp(z)].G=\mathbb E_q[\log q(z)-\log p_*(z)].

The ELBO term is

L=Eq[logpθ(x,z)logq(z)].\mathcal L = \mathbb E_q[\log p_\theta(x,z)-\log q(z)].

The gap can also be read by subtraction:

G=logpθ(x)L.G=\log p_\theta(x)-\mathcal L.

KL is nonnegative, so

L(θ,ϕ;x)logpθ(x).\mathcal L(\theta,\phi;x)\le \log p_\theta(x).

Maximizing the ELBO improves this lower bound. The bound is tight exactly when the encoder distribution equals the true posterior, up to probability-zero events.

Reconstruction likelihood plus prior KL

Because

logpθ(x,z)=logpθ(xz)+logp(z),\log p_\theta(x,z) = \log p_\theta(x\mid z)+\log p(z),

the ELBO can be rewritten as

L=Eq[logpθ(xz)]KL(q(z)p(z)).\begin{aligned} \mathcal L &= \mathbb E_q[\log p_\theta(x\mid z)] \\ &\quad- \mathrm{KL}(q(z)\,\|\,p(z)). \end{aligned}

The first term is a reconstruction log likelihood, not generically a pixel distance. The second term keeps the encoder's posterior close to the prior used for generation. At generation time we sample zp(z)z\sim p(z), not zqϕ(zx)z\sim q_\phi(z\mid x) for a training example, so the prior match is load-bearing.

Reconstruction likelihood is not always MSE

The reconstruction term is

Eq[logpθ(xz)].\mathbb E_q[\log p_\theta(x\mid z)].

It becomes a familiar loss only after choosing a decoder distribution. With a fixed-variance Gaussian decoder centered at fθ(z)f_\theta(z), the negative log likelihood is a scaled squared error plus a constant. So MSE appears as a Gaussian negative log likelihood only after that modeling choice.

If xx is binary and

pθ(xz)=Bernoulli(πθ(z)),p_\theta(x\mid z)=\operatorname{Bernoulli}(\pi_\theta(z)),

then the negative reconstruction log likelihood is binary cross-entropy.

Reparameterization trick

For a diagonal Gaussian encoder, the network outputs μϕ(x)\mu_\phi(x) and σϕ(x)\sigma_\phi(x). We sample with external noise:

ϵN(0,I),z=μϕ(x)+σϕ(x)ϵ.\epsilon\sim\mathcal N(0,I), \qquad z=\mu_\phi(x)+\sigma_\phi(x)\odot\epsilon.

This keeps the randomness in ϵ\epsilon and makes the sample differentiable with respect to the encoder outputs μϕ(x)\mu_\phi(x) and σϕ(x)\sigma_\phi(x).

Beta-VAE objective

A beta-VAE changes the training objective to

Lβ=Eq[logpθ(xz)]βKL(q(z)p(z)).\begin{aligned} \mathcal L_\beta &= \mathbb E_q[\log p_\theta(x\mid z)] \\ &\quad- \beta\,\mathrm{KL}(q(z)\,\|\,p(z)). \end{aligned}

When β=1\beta=1, this is the standard VAE ELBO. When β1\beta\ne 1, it is a modified objective rather than the original ELBO. For β>1\beta>1, it remains a lower bound because it subtracts extra nonnegative KL penalty, but it is no longer the standard maximum-likelihood ELBO. For β<1\beta<1, it is generally not guaranteed to lower-bound logpθ(x)\log p_\theta(x).

Larger β\beta puts more pressure on zz to carry less information and stay close to the prior. This can encourage more factor-like representations in some settings, but it does not guarantee disentanglement. If the pressure is too strong, the encoder may approach qϕ(zx)p(z)q_\phi(z\mid x)\approx p(z), so zz carries little information about xx.

03

03

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

This scalar linear-Gaussian example is small enough that we can compute the exact marginal likelihood, exact posterior, ELBO, and inference gap, then check a one-sample reparameterized reconstruction gradient by finite differences.

import math

def log_norm(x, mean, var):
    return -0.5 * (math.log(2 * math.pi * var) + (x - mean) ** 2 / var)

def kl_norm(mu0, var0, mu1, var1):
    return 0.5 * (var0 / var1 + (mu1 - mu0) ** 2 / var1 - 1 + math.log(var1 / var0))

x, w, b, sigma_x = 1.5, 1.2, -0.1, 0.3  # z~N(0,1), x|z~N(wz+b,sigma_x^2)
var_x = sigma_x ** 2
mu_q, logvar_q = 0.3, -0.7              # q(z|x)=N(mu_q, exp(logvar_q))
var_q = math.exp(logvar_q)

log_px = log_norm(x, b, w * w + var_x)
post_var = 1 / (1 + w * w / var_x)
post_mu = post_var * w * (x - b) / var_x
recon = -0.5 * (math.log(2 * math.pi * var_x) + ((x - b - w * mu_q) ** 2 + w * w * var_q) / var_x)
kl_prior = kl_norm(mu_q, var_q, 0, 1)
elbo = recon - kl_prior
gap = log_px - elbo
kl_post = kl_norm(mu_q, var_q, post_mu, post_var)
assert abs(gap - kl_post) < 1e-10

eps = -0.4
def sampled_recon(mu, logvar):
    sigma = math.exp(0.5 * logvar)
    z = mu + sigma * eps
    return log_norm(x, w * z + b, var_x), z, sigma

sample_loglik, z, sigma = sampled_recon(mu_q, logvar_q)
dloglik_dz = (x - (w * z + b)) * w / var_x
path_mu = dloglik_dz
path_logvar = dloglik_dz * 0.5 * sigma * eps
h = 1e-5
fd_mu = (sampled_recon(mu_q + h, logvar_q)[0] - sampled_recon(mu_q - h, logvar_q)[0]) / (2 * h)
fd_logvar = (sampled_recon(mu_q, logvar_q + h)[0] - sampled_recon(mu_q, logvar_q - h)[0]) / (2 * h)
assert abs(path_mu - fd_mu) < 1e-7
assert abs(path_logvar - fd_logvar) < 1e-7

print(round(elbo, 3), round(log_px, 3), round(gap, 3), round(sample_loglik, 3))

For a batched neural VAE, typical shapes are:

  • x: (B, D)
  • encoder outputs mu, logvar: (B, K)
  • noise eps: (S, B, K)
  • latent samples z = mu[None, :, :] + exp(0.5 * logvar)[None, :, :] * eps: (S, B, K)
  • decoder log likelihood per sample: (S, B)
  • Monte Carlo reconstruction estimate: (B,)
  • analytic diagonal-Gaussian KL: (B,)
  • per-example ELBO: (B,)
  • optimization loss -elbo.mean(): scalar
04

04

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

The demo keeps the latent variable one-dimensional so the hidden inference geometry can be revealed after a prediction. Move the approximate posterior q(zx)q(z\mid x), inspect the prior and likelihood shape, then predict what the ELBO gap will diagnose.

After reveal, compare the target curve and identity values with your prediction. The point is to feel the ELBO as a lower bound whose looseness is measured by the mismatch between qq and the hidden target distribution.

Live Concept Demo

Explore Variational Autoencoders

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned
Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Variational Autoencoders should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Prediction open01 / Intuition
Editorial generative-model illustration of data encoded into a latent distribution and decoded back into reconstructed samples.
Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Variational Autoencoders should make visible.

Visual Inquiry

Make the image answer a mathematical question

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

4/4 stages readyLive demo connected
Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Variational Autoencoders easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

paper · 2013Auto-Encoding Variational BayesKingma and Welling

Introduces the reparameterized VAE objective and the evidence lower bound training view.

Open source
paper · 2014Stochastic Backpropagation and Approximate Inference in Deep Generative ModelsRezende, Mohamed, and Wierstra

Grounds stochastic backpropagation for latent-variable generative models with learned inference networks.

Open source

Claim Review

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources2 references

kingma-2013-auto-encoding-variational-bayes, rezende-2014-stochastic-backprop

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedA VAE fits a latent-variable generative model by maximizing the ELBO: q_phi(z|x) approximates the usually intractable posterior, E_q[log p_theta(x|z)] rewards reconstruction likelihood, KL(q_phi(z|x)||p(z)) penalizes deviation from the prior, and reparameterization gives pathwise gradients through z samples.Claim metadata: source checked

Kingma/Welling and Rezende/Mohamed/Wierstra both ground VAEs/deep latent-variable models in ELBO optimization with learned inference networks and reparameterized or stochastic backpropagation estimators. Local math now exports the ELBO identity, ELBO decomposition, and reparameterized sample, while the code and demo check the finite ELBO-gap geometry.

Sources: Auto-Encoding Variational Bayes, Stochastic Backpropagation and Approximate Inference in Deep Generative ModelsThe reconstruction term is an expectation under q_phi and often Monte Carlo estimated. Pathwise gradients require a differentiable transform of parameter-independent noise, so this does not cover discrete or non-reparameterizable latents.A bounded review summary is present; still check caveats and exact source scope.

Kingma/Welling support q_phi(z|x) as a learned recognition model for the intractable posterior, derive log p_theta(x)=ELBO+KL(q||posterior), decompose the ELBO into E_q log p_theta(x|z)-KL(q||p(z)), and give Gaussian reparameterization. Rezende et al. independently support the recognition-model, lower-bound/free-energy, reconstruction/regularization, and stochastic-backprop structure. Local math, code, and demo now witness the ELBO gap, prior KL, reparameterized sample, and prediction-gated posterior diagnostics.

Reviewer: codex+oracle+codex-5.3; reviewed 2026-05-08

Practice Loop

Try the idea before it explains itself

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Readiness0/3 checks ready
Predict

Before touching the demo, predict one visible change that should happen in Variational Autoencoders.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Object research drawerClose
ConceptVariational AutoencodersGenerative Models

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.
Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptGenerative Models

Variational Autoencoders

Anchored question

What is the smallest example that makes Variational Autoencoders click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action
Local action draft

This draft stays locally in this browser for concept:generative-models/vaes.

No local draft saved.
Evidence to inspect
  • Source ids to inspect: kingma-2013-auto-encoding-variational-bayes, rezende-2014-stochastic-backprop
  • Definition, prerequisite, and contrast concept links
  • The equation or code witness that makes the concept operational
  • One demo state that shows the invariant instead of a slogan
What would resolve this
  • The learner can state the mechanism in their own words
  • The learner can name the prerequisite that would repair confusion
  • The learner can predict how the mechanism changes under one perturbation
Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Variational Autoencoders Object key: concept:generative-models/vaes Context: Generative Models Anchor id: concept/concept-notebook/generative-models/vaes Open question: What is the smallest example that makes Variational Autoencoders click without losing the math? Evidence to inspect: - Source ids to inspect: kingma-2013-auto-encoding-variational-bayes, rezende-2014-stochastic-backprop - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object
concept/concept-notebook/generative-models/vaes concept:generative-models/vaes