Generative Models

Variational Autoencoders

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 20mlive demo

Back to Generative Models Next: Normalizing Flows: Tractable Density via Invertible Transforms

Editorial generative-model illustration of data encoded into a latent distribution and decoded back into reconstructed samples.

Concept Structure

Variational Autoencoders

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites

2next concepts

1related links

Learning map

Variational Autoencoders

BeforeMaximum LikelihoodNow4/4 sections readyTryManipulate one control and predict the visible change.NextNormalizing Flows: Tractable Density via Invertible Transforms

Object flow

4/4 sections readyAsk about this Research room

ConceptVariational AutoencodersGenerative Models EquationVariational Autoencoders equation 1Exact equation object CodeVariational Autoencoders code witness 1Exact code witness DemoVariational Autoencoders interactive demoVisualization object ClaimA VAE fits a latent-variable generative model by maximizing the ELBO:...Exact claim check SourceAuto-Encoding Variational BayesExact source object

ConceptVariational AutoencodersGenerative Models

2 sources attachedLocal snapshot ready

concept:generative-models/vaes

Codewitness nearby Predictbefore reveal Roomobject handoff

Conceptual Bridge

What should feel connected as you move through this page.

Carry inMaximum Likelihood

Bring the mental model from Maximum Likelihood; this page will reuse it instead of restarting from zero.

Work hereVariational Autoencoders

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Carry outNormalizing Flows: Tractable Density via Invertible Transforms

The next edge should feel earned: use the demo prediction here before following Normalizing Flows: Tractable Density via Invertible Transforms.

Test the linkManipulate one control and predict the visible change.Then continue to Normalizing Flows: Tractable Density via Invertible Transforms

01IntuitionStart with the picture, metaphor, or geometric mechanism.02MathMake the objects explicit and connect them with notation.03CodeMirror the equations with runnable implementation details.04Interactive DemoManipulate the mechanism and watch the idea respond.

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Maximum likelihood asks for a model that assigns high log probability to the observed data:

\log p_\theta(x).

A variational autoencoder is a maximum-likelihood model with a hidden cause. It imagines that each observation $x$ was produced from an unobserved latent variable $z$ :

\begin{aligned} z &\sim p(z), \\ x &\sim p_\theta(x\mid z). \end{aligned}

Training would be easy if we could compute the marginal likelihood

\begin{aligned} p_\theta(x) &= \int p_\theta(x,z)\,dz. \end{aligned}

The problem is that this integral is usually hard for neural decoders. Bayesian inference tells us what we would like to use:

p_\theta(z\mid x).

But that posterior contains the same hard evidence term $p_\theta(x)$ . The VAE move is to learn an encoder

q_\phi(z\mid x)

that acts as a tractable approximate posterior.

The central mechanism is not "an autoencoder with noise." It is: replace the intractable posterior with a learned distribution $q_\phi(z\mid x)$ , then optimize a lower bound whose looseness is exactly a KL divergence from $q_\phi$ to the true posterior.

That is why VAEs sit at the intersection of maximum likelihood, Bayesian inference, and KL divergence.

Two natural next moves branch from this page. Normalizing flows ask what changes when the latent transformation is invertible and likelihood stays exact. Diffusion asks what changes when generation is learned as a gradual denoising process rather than as one sampled latent code.

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Equation 1p_\theta(x,z)=p(z)p_\theta(x\mid z); p_\theta(x)=\int p_\theta(x,z)\,dz; p_\theta(z\mid x)=\f...Equation 2\log p_\theta(x) = \mathcal L(\theta,\phi;x) + \mathrm{KL}\!\left(q_\phi(z\mid x)\,\|\,p_\the...

Latent variable likelihood

For one observed datapoint $x$ , assume a prior $p(z)$ and a decoder likelihood $p_\theta(x\mid z)$ . The latent-variable model, marginal likelihood, and true posterior are

\begin{aligned} p_\theta(x,z)&=p(z)p_\theta(x\mid z),\\ p_\theta(x)&=\int p_\theta(x,z)\,dz,\\ p_\theta(z\mid x)&=\frac{p_\theta(x,z)}{p_\theta(x)}. \end{aligned}

The evidence $p_\theta(x)$ appears in the denominator, so exact posterior inference and exact likelihood training are tied to the same difficult integral. The claim-bearing VAE move is summarized by

\begin{aligned} \log p_\theta(x) &= \mathcal L(\theta,\phi;x) + \mathrm{KL}\!\left(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)\right),\\ \mathcal L(\theta,\phi;x) &= \mathbb E_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - \mathrm{KL}\!\left(q_\phi(z\mid x)\,\|\,p(z)\right),\\ \epsilon&\sim\mathcal N(0,I), \qquad z=\mu_\phi(x)+\sigma_\phi(x)\odot\epsilon. \end{aligned}

The first line says the ELBO gap is exactly the KL from the learned encoder $q_\phi(z\mid x)$ to the true posterior. The second line separates reconstruction likelihood from prior pressure. The third line keeps randomness in external noise $\epsilon$ , so gradients of sampled reconstruction terms can flow through $z$ into the encoder outputs.

The ELBO identity

Let $q_\phi(z\mid x)$ be any approximate posterior whose support is compatible with the true posterior. To keep the equations readable, write $q(z)$ for $q_\phi(z\mid x)$ and $\mathbb E_q$ for expectation under that distribution:

\mathbb E_q[\cdot] \equiv \mathbb E_{q(z)}[\cdot].

Using Bayes' rule and rearranging gives the exact identity

\log p_\theta(x)=\mathcal L+G.

Here $\mathcal L$ is the ELBO and $G$ is the inference gap. Let $p_*(z)$ denote the true posterior $p_\theta(z\mid x)$ . Then the gap is

G=\mathrm{KL}(q(z)\,\|\,p_*(z)).

Equivalently,

G=\mathbb E_q[\log q(z)-\log p_*(z)].

The ELBO term is

\mathcal L = \mathbb E_q[\log p_\theta(x,z)-\log q(z)].

The gap can also be read by subtraction:

G=\log p_\theta(x)-\mathcal L.

KL is nonnegative, so

\mathcal L(\theta,\phi;x)\le \log p_\theta(x).

Maximizing the ELBO improves this lower bound. The bound is tight exactly when the encoder distribution equals the true posterior, up to probability-zero events.

Reconstruction likelihood plus prior KL

Because

\log p_\theta(x,z) = \log p_\theta(x\mid z)+\log p(z),

the ELBO can be rewritten as

\begin{aligned} \mathcal L &= \mathbb E_q[\log p_\theta(x\mid z)] \\ &\quad- \mathrm{KL}(q(z)\,\|\,p(z)). \end{aligned}

The first term is a reconstruction log likelihood, not generically a pixel distance. The second term keeps the encoder's posterior close to the prior used for generation. At generation time we sample $z\sim p(z)$ , not $z\sim q_\phi(z\mid x)$ for a training example, so the prior match is load-bearing.

Reconstruction likelihood is not always MSE

The reconstruction term is

\mathbb E_q[\log p_\theta(x\mid z)].

It becomes a familiar loss only after choosing a decoder distribution. With a fixed-variance Gaussian decoder centered at $f_\theta(z)$ , the negative log likelihood is a scaled squared error plus a constant. So MSE appears as a Gaussian negative log likelihood only after that modeling choice.

If $x$ is binary and

p_\theta(x\mid z)=\operatorname{Bernoulli}(\pi_\theta(z)),

then the negative reconstruction log likelihood is binary cross-entropy.

Reparameterization trick

For a diagonal Gaussian encoder, the network outputs $\mu_\phi(x)$ and $\sigma_\phi(x)$ . We sample with external noise:

\epsilon\sim\mathcal N(0,I), \qquad z=\mu_\phi(x)+\sigma_\phi(x)\odot\epsilon.

This keeps the randomness in $\epsilon$ and makes the sample differentiable with respect to the encoder outputs $\mu_\phi(x)$ and $\sigma_\phi(x)$ .

Beta-VAE objective

A beta-VAE changes the training objective to

\begin{aligned} \mathcal L_\beta &= \mathbb E_q[\log p_\theta(x\mid z)] \\ &\quad- \beta\,\mathrm{KL}(q(z)\,\|\,p(z)). \end{aligned}

When $\beta=1$ , this is the standard VAE ELBO. When $\beta\ne 1$ , it is a modified objective rather than the original ELBO. For $\beta>1$ , it remains a lower bound because it subtracts extra nonnegative KL penalty, but it is no longer the standard maximum-likelihood ELBO. For $\beta<1$ , it is generally not guaranteed to lower-bound $\log p_\theta(x)$ .

Larger $\beta$ puts more pressure on $z$ to carry less information and stay close to the prior. This can encourage more factor-like representations in some settings, but it does not guarantee disentanglement. If the pressure is too strong, the encoder may approach $q_\phi(z\mid x)\approx p(z)$ , so $z$ carries little information about $x$ .

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

Code witness 1import math def log_norm(x, mean, var): return -0.5 * (math.log(2 * math.pi * var) + (x - mea...python

This scalar linear-Gaussian example is small enough that we can compute the exact marginal likelihood, exact posterior, ELBO, and inference gap, then check a one-sample reparameterized reconstruction gradient by finite differences.

import math

def log_norm(x, mean, var):
    return -0.5 * (math.log(2 * math.pi * var) + (x - mean) ** 2 / var)

def kl_norm(mu0, var0, mu1, var1):
    return 0.5 * (var0 / var1 + (mu1 - mu0) ** 2 / var1 - 1 + math.log(var1 / var0))

x, w, b, sigma_x = 1.5, 1.2, -0.1, 0.3  # z~N(0,1), x|z~N(wz+b,sigma_x^2)
var_x = sigma_x ** 2
mu_q, logvar_q = 0.3, -0.7              # q(z|x)=N(mu_q, exp(logvar_q))
var_q = math.exp(logvar_q)

log_px = log_norm(x, b, w * w + var_x)
post_var = 1 / (1 + w * w / var_x)
post_mu = post_var * w * (x - b) / var_x
recon = -0.5 * (math.log(2 * math.pi * var_x) + ((x - b - w * mu_q) ** 2 + w * w * var_q) / var_x)
kl_prior = kl_norm(mu_q, var_q, 0, 1)
elbo = recon - kl_prior
gap = log_px - elbo
kl_post = kl_norm(mu_q, var_q, post_mu, post_var)
assert abs(gap - kl_post) < 1e-10

eps = -0.4
def sampled_recon(mu, logvar):
    sigma = math.exp(0.5 * logvar)
    z = mu + sigma * eps
    return log_norm(x, w * z + b, var_x), z, sigma

sample_loglik, z, sigma = sampled_recon(mu_q, logvar_q)
dloglik_dz = (x - (w * z + b)) * w / var_x
path_mu = dloglik_dz
path_logvar = dloglik_dz * 0.5 * sigma * eps
h = 1e-5
fd_mu = (sampled_recon(mu_q + h, logvar_q)[0] - sampled_recon(mu_q - h, logvar_q)[0]) / (2 * h)
fd_logvar = (sampled_recon(mu_q, logvar_q + h)[0] - sampled_recon(mu_q, logvar_q - h)[0]) / (2 * h)
assert abs(path_mu - fd_mu) < 1e-7
assert abs(path_logvar - fd_logvar) < 1e-7

print(round(elbo, 3), round(log_px, 3), round(gap, 3), round(sample_loglik, 3))

For a batched neural VAE, typical shapes are:

x: (B, D)
encoder outputs mu, logvar: (B, K)
noise eps: (S, B, K)
latent samples z = mu[None, :, :] + exp(0.5 * logvar)[None, :, :] * eps: (S, B, K)
decoder log likelihood per sample: (S, B)
Monte Carlo reconstruction estimate: (B,)
analytic diagonal-Gaussian KL: (B,)
per-example ELBO: (B,)
optimization loss -elbo.mean(): scalar

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

The demo keeps the latent variable one-dimensional so the hidden inference geometry can be revealed after a prediction. Move the approximate posterior $q(z\mid x)$ , inspect the prior and likelihood shape, then predict what the ELBO gap will diagnose.

After reveal, compare the target curve and identity values with your prediction. The point is to feel the ELBO as a lower bound whose looseness is measured by the mismatch between $q$ and the hidden target distribution.

Live Concept Demo

Explore Variational Autoencoders

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned

Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Variational Autoencoders should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Prediction open01 / Intuition

Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Variational Autoencoders should make visible.

Visual Inquiry

Make the image answer a mathematical question

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

4/4 stages readyLive demo connected

Visual cueWhich visible object should carry the first intuition?

Inspection depth2/4

Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Variational Autoencoders easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

paper · 2013Auto-Encoding Variational BayesKingma and Welling

Introduces the reparameterized VAE objective and the evidence lower bound training view.

Open source

paper · 2014Stochastic Backpropagation and Approximate Inference in Deep Generative ModelsRezende, Mohamed, and Wierstra

Grounds stochastic backpropagation for latent-variable generative models with learned inference networks.

Open source

Claim Review

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources2 references

kingma-2013-auto-encoding-variational-bayes, rezende-2014-stochastic-backprop

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedA VAE fits a latent-variable generative model by maximizing the ELBO: q_phi(z|x) approximates the usually intractable posterior, E_q[log p_theta(x|z)] rewards reconstruction likelihood, KL(q_phi(z|x)||p(z)) penalizes deviation from the prior, and reparameterization gives pathwise gradients through z samples.Claim metadata: source checked

Kingma/Welling and Rezende/Mohamed/Wierstra both ground VAEs/deep latent-variable models in ELBO optimization with learned inference networks and reparameterized or stochastic backpropagation estimators. Local math now exports the ELBO identity, ELBO decomposition, and reparameterized sample, while the code and demo check the finite ELBO-gap geometry.

Sources: Auto-Encoding Variational Bayes, Stochastic Backpropagation and Approximate Inference in Deep Generative ModelsThe reconstruction term is an expectation under q_phi and often Monte Carlo estimated. Pathwise gradients require a differentiable transform of parameter-independent noise, so this does not cover discrete or non-reparameterizable latents.A bounded review summary is present; still check caveats and exact source scope.

Kingma/Welling support q_phi(z|x) as a learned recognition model for the intractable posterior, derive log p_theta(x)=ELBO+KL(q||posterior), decompose the ELBO into E_q log p_theta(x|z)-KL(q||p(z)), and give Gaussian reparameterization. Rezende et al. independently support the recognition-model, lower-bound/free-energy, reconstruction/regularization, and stochastic-backprop structure. Local math, code, and demo now witness the ELBO gap, prior KL, reparameterized sample, and prediction-gated posterior diagnostics.

Reviewer: codex+oracle+codex-5.3; reviewed 2026-05-08

source-span-kingma-2013-auto-encoding-variational-bayes source-span-rezende-2014-stochastic-backprop math-object-1 math-object-2 code-witness-1 interactive-demo

Source support candidates

paper 2013Auto-Encoding Variational Bayes

Introduces the reparameterized VAE objective and the evidence lower bound training view.

paper 2014Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Grounds stochastic backpropagation for latent-variable generative models with learned inference networks.

Mechanism witnesses

Equation 1

\begin{aligned} p_\theta(x,z)&=p(z)p_\theta(x\mid z),\\ p_\theta(x)&=\int p_\theta(x,z)\,dz,\\ p_\theta(z\mid x)&=\frac{p_\theta(x,z)}{p_\theta(x)}. \end{aligned}

Equation 2

\begin{aligned} \log p_\theta(x) &= \mathcal L(\theta,\phi;x) + \mathrm{KL}\!\left(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)\right),\\ \mathcal L(\theta,\phi;x) &= \mathbb E_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - \mathrm{KL}\!\left(q_\phi(z\mid x)\,\|\,p(z)\right),\\ \epsilon&\sim\mathcal N(0,I), \qquad z=\mu_\phi(x)+\sigma_\phi(x)\odot\epsilon. \end{aligned}

Code witness 1import math def log_norm(x, mean, var): return -0.5 * (math.log(2 * math.pi * var) + (x - mea...Demo stateLive mechanism probe

Practice Loop

Try the idea before it explains itself

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Readiness0/3 checks ready

Predict

Before touching the demo, predict one visible change that should happen in Variational Autoencoders.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Your answer canvas

Local checks

Claim

A concrete answer is on the canvas.

Mechanism

The answer names why the claim should hold.

Bridge

It touches the page context or a neighboring idea.

Misconception check

Object research drawerClose

ConceptVariational AutoencodersGenerative Models

Code witness comparisonVariational Autoencoders code witness 1assert abs(gap - kl_post) < 1e-10 Prediction before revealVariational Autoencoders interactive demoManipulate one control and predict the visible change.

Grounded room questionWhat is the smallest example that makes Variational Autoencoders click without losing the math?Local snapshot ready

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.

Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptGenerative Models

Variational Autoencoders

Anchored question

What is the smallest example that makes Variational Autoencoders click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action

Local action draft

This draft stays locally in this browser for concept:generative-models/vaes.

Draft noteNext action

No local draft saved.

Evidence to inspect

Source ids to inspect: kingma-2013-auto-encoding-variational-bayes, rezende-2014-stochastic-backprop
Definition, prerequisite, and contrast concept links
The equation or code witness that makes the concept operational
One demo state that shows the invariant instead of a slogan

What would resolve this

The learner can state the mechanism in their own words
The learner can name the prerequisite that would repair confusion
The learner can predict how the mechanism changes under one perturbation

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Variational Autoencoders Object key: concept:generative-models/vaes Context: Generative Models Anchor id: concept/concept-notebook/generative-models/vaes Open question: What is the smallest example that makes Variational Autoencoders click without losing the math? Evidence to inspect: - Source ids to inspect: kingma-2013-auto-encoding-variational-bayes, rezende-2014-stochastic-backprop - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object

concept/concept-notebook/generative-models/vaes
concept:generative-models/vaes

Learning Map

Before / Now / Try / Next

BeforeMaximum Likelihood

NowIntuition → Math → Code → Demo

TryManipulate one control and predict the visible change.

NextNormalizing Flows: Tractable Density via Invertible Transforms

Intuitionready
Mathready
Codeready
Interactive Demoready

Object Companion

Ask beside the selected object

A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Generative Models concept. Learning surface: Variational Autoencoders. What this page says: A latent-variable model trained by maximizing an evidence lower bound; the gap is KL(q_phi(z|x) || p_theta(z|x)), so the encoder is learned inference rather than just compression. Current section: Intuition, math, code, and interactive demo. Suggested next step: Manipulate one control and predict the visible change.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Explain the central idea in plain language, then restate it with the exact math objects from the page. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.

Domain

Generative Models

generative-modelsvaevariational-inferenceelbokl-divergence

Prerequisites

Maximum Likelihood Bayesian Inference KL Divergence (Relative Entropy)

Leads To

Normalizing Flows: Tractable Density via Invertible Transforms Diffusion, Score-Based Models & Flow Matching

Score Matching & Score-Based Generative Models

Within this domain