Legacy Concept Lab
Optimal Transport & Wasserstein Distance
Wasserstein distance explains why WGANs are more stable than vanilla GANs—it provides gradients even when distributions do not overlap
#38OT/WassersteinTheory
key equation
W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int \|x - y\|^p \, d\gamma(x, y) \right)^{1/p}Phase 8: Scaling, theory & multimodalConcept 38 of 100
Why It Matters for Modern Models
- Wasserstein distance explains why WGANs are more stable than vanilla GANs—it provides gradients even when distributions do not overlap
- Flow matching and rectified flows are built on OT: they learn the optimal transport map directly
- OT provides a principled way to measure distance between distributions that respects geometry
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- KL divergence is infinite when supports do not overlap; Wasserstein is finite and measures "how far to move mass"
- The "earth mover" intuition: imagine distributions as piles of dirt, OT finds the cheapest way to reshape one into the other
- Entropic regularization (Sinkhorn) makes OT tractable: adds −εH(γ) to get differentiable, GPU-friendly algorithms
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Optimal transport finds the minimum-cost way to move mass from distribution to :
where is the set of joint distributions with marginals .
Kantorovich duality (key for computation):
Brenier's theorem: For absolutely continuous , the optimal map is for convex .