Legacy Concept Lab

Natural Gradient & Riemannian Optimization

Natural gradient is coordinate-invariant—it gives the same update regardless of parameterization

Concept 56 of 100OptimizationPhase 10

#56Natural GradOptimization

key equation\tilde{\nabla} L = F(\theta)^{-1} \nabla_\theta L

Phase 10: Mathematical foundations & information geometryConcept 56 of 100

Why It Matters for Modern Models

Natural gradient is coordinate-invariant—it gives the same update regardless of parameterization
TRPO and PPO are approximations to natural gradient updates for policy optimization
Adam can be viewed as a diagonal approximation to natural gradient with adaptive preconditioning

What is still poorly explained in textbooks and papers:

The gradient is a covector, not a vector—the metric turns it into a direction of steepest descent
Euclidean gradient depends on how you parameterize; natural gradient depends only on the distributions
Natural gradient avoids plateaus faster because it accounts for local curvature in distribution space

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\tilde{\nabla} L = F(\theta)^{-1} \nabla_\theta L

Natural gradient uses the Fisher metric instead of Euclidean:

\tilde{\nabla} L = F(\theta)^{-1} \nabla_\theta L

Update rule:

\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla L

Variational characterization (why it's "natural"):

\delta^* = \arg\min_\delta \langle \nabla L, \delta \rangle \quad \text{s.t.} \quad \text{KL}(p_\theta \| p_{\theta+\delta}) \leq \epsilon

\Rightarrow \delta^* \propto F^{-1} \nabla L

Amari1998Neural Computation

Schulman et al.2015ICML

Explore this concept from different angles — like a mathematician would.