Legacy Concept Lab

Fisher Information & Information Geometry

Fisher gives the natural metric on probability distributions—not Euclidean distance in parameters

Concept 55 of 100TheoryPhase 10

#55Fisher InfoTheory

key equationF_{ij}(\theta) = \mathbb{E}\left[\partial_i \log p_\theta \cdot \partial_j \log p_\theta\right]

Phase 10: Mathematical foundations & information geometryConcept 55 of 100

Why It Matters for Modern Models

Fisher gives the natural metric on probability distributions—not Euclidean distance in parameters
Explains why KL penalties in RLHF/PPO are geometric constraints, not arbitrary regularization
Connects curvature to uncertainty: high Fisher = parameters are well-identified

What is still poorly explained in textbooks and papers:

Distance in parameter space should mean "distinguishability of distributions"—Fisher captures this
The Cramér-Rao bound: variance of any estimator ≥ 1/Fisher—more info = tighter estimates
Fisher is the Hessian of KL at θ=θ₀, making it a second-order object without needing the loss Hessian

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

F_{ij}(\theta) = \mathbb{E}\left[\partial_i \log p_\theta \cdot \partial_j \log p_\theta\right]

The Fisher Information Matrix measures how distinguishable distributions are:

F_{ij}(\theta) = \mathbb{E}_{x \sim p_\theta}\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i} \frac{\partial \log p_\theta(x)}{\partial \theta_j}\right]

Equivalently (under regularity):

F_{ij}(\theta) = -\mathbb{E}_{x \sim p_\theta}\left[\frac{\partial^2 \log p_\theta(x)}{\partial \theta_i \partial \theta_j}\right]

KL as local metric: For small parameter changes:

\text{KL}(p_\theta \| p_{\theta + d\theta}) \approx \frac{1}{2} d\theta^\top F(\theta) d\theta

Amari2016Springer

Amari1998Neural Computation

Explore this concept from different angles — like a mathematician would.