Legacy Concept Lab

Optimal Transport & Wasserstein Distance

Wasserstein distance explains why WGANs are more stable than vanilla GANs—it provides gradients even when distributions do not overlap

Concept 38 of 100TheoryPhase 8
#38OT/WassersteinTheory
key equationW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int \|x - y\|^p \, d\gamma(x, y) \right)^{1/p}
Phase 8: Scaling, theory & multimodalConcept 38 of 100

Why It Matters for Modern Models

  • Wasserstein distance explains why WGANs are more stable than vanilla GANs—it provides gradients even when distributions do not overlap
  • Flow matching and rectified flows are built on OT: they learn the optimal transport map directly
  • OT provides a principled way to measure distance between distributions that respects geometry

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • KL divergence is infinite when supports do not overlap; Wasserstein is finite and measures "how far to move mass"
  • The "earth mover" intuition: imagine distributions as piles of dirt, OT finds the cheapest way to reshape one into the other
  • Entropic regularization (Sinkhorn) makes OT tractable: adds −εH(γ) to get differentiable, GPU-friendly algorithms

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
Wp(μ,ν)=(infγΓ(μ,ν)xypdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int \|x - y\|^p \, d\gamma(x, y) \right)^{1/p}

Optimal transport finds the minimum-cost way to move mass from distribution μ\mu to ν\nu:

Wp(μ,ν)=(infγΓ(μ,ν)xypdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int \|x - y\|^p \, d\gamma(x, y) \right)^{1/p}

where Γ(μ,ν)\Gamma(\mu, \nu) is the set of joint distributions with marginals μ,ν\mu, \nu.

Kantorovich duality (key for computation):

W1(μ,ν)=supfL1Exμ[f(x)]Eyν[f(y)]W_1(\mu, \nu) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim \mu}[f(x)] - \mathbb{E}_{y \sim \nu}[f(y)]

Brenier's theorem: For absolutely continuous μ\mu, the optimal map is T=ϕT = \nabla \phi for convex ϕ\phi.

Canonical Papers

Computational Optimal Transport

Peyré & Cuturi2019Foundations and Trends in ML
Read paper →

Wasserstein GAN

Arjovsky, Chintala, Bottou2017ICML
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.