Legacy Concept Lab

Fisher Information & Information Geometry

Fisher gives the natural metric on probability distributions—not Euclidean distance in parameters

Concept 55 of 100TheoryPhase 10
#55Fisher InfoTheory
key equationF_{ij}(\theta) = \mathbb{E}\left[\partial_i \log p_\theta \cdot \partial_j \log p_\theta\right]
Phase 10: Mathematical foundations & information geometryConcept 55 of 100

Why It Matters for Modern Models

  • Fisher gives the natural metric on probability distributions—not Euclidean distance in parameters
  • Explains why KL penalties in RLHF/PPO are geometric constraints, not arbitrary regularization
  • Connects curvature to uncertainty: high Fisher = parameters are well-identified

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Distance in parameter space should mean "distinguishability of distributions"—Fisher captures this
  • The Cramér-Rao bound: variance of any estimator ≥ 1/Fisher—more info = tighter estimates
  • Fisher is the Hessian of KL at θ=θ₀, making it a second-order object without needing the loss Hessian

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
Fij(θ)=E[ilogpθjlogpθ]F_{ij}(\theta) = \mathbb{E}\left[\partial_i \log p_\theta \cdot \partial_j \log p_\theta\right]

The Fisher Information Matrix measures how distinguishable distributions are:

Fij(θ)=Expθ[logpθ(x)θilogpθ(x)θj]F_{ij}(\theta) = \mathbb{E}_{x \sim p_\theta}\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i} \frac{\partial \log p_\theta(x)}{\partial \theta_j}\right]

Equivalently (under regularity):

Fij(θ)=Expθ[2logpθ(x)θiθj]F_{ij}(\theta) = -\mathbb{E}_{x \sim p_\theta}\left[\frac{\partial^2 \log p_\theta(x)}{\partial \theta_i \partial \theta_j}\right]

KL as local metric: For small parameter changes:

KL(pθpθ+dθ)12dθF(θ)dθ\text{KL}(p_\theta \| p_{\theta + d\theta}) \approx \frac{1}{2} d\theta^\top F(\theta) d\theta

Canonical Papers

Information Geometry and Its Applications

Amari2016Springer
Read paper →

Natural Gradient Works Efficiently in Learning

Amari1998Neural Computation
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.