Legacy Concept Lab

Bregman Divergence & Mirror Descent

Explains why different geometries suit different problems—simplex needs KL, not Euclidean

Concept 70 of 100TheoryPhase 10

#70BregmanTheory

key equationD_\Phi(p, q) = \Phi(p) - \Phi(q) - \langle \nabla\Phi(q), p - q \rangle

Phase 10: Mathematical foundations & information geometryConcept 70 of 100

Why It Matters for Modern Models

Explains why different geometries suit different problems—simplex needs KL, not Euclidean
Exponentiated gradient (softmax updates) is mirror descent with entropy potential
Natural gradient is Bregman geometry with Fisher information as the potential

What is still poorly explained in textbooks and papers:

Euclidean gradient descent is ONE choice; mirror descent is the general framework
The "mirror map" transforms to dual coordinates where steps are linear
KL divergence is the Bregman divergence for probability distributions

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

D_\Phi(p, q) = \Phi(p) - \Phi(q) - \langle \nabla\Phi(q), p - q \rangle

Bregman divergence generated by strictly convex $\Phi$ :

D_\Phi(p, q) = \Phi(p) - \Phi(q) - \langle \nabla\Phi(q), p - q \rangle

Mirror descent:

x_{t+1} = \arg\min_{x \in \mathcal{X}} \left\{ \langle g_t, x \rangle + \frac{1}{\eta} D_\Phi(x, x_t) \right\}

Special cases:

$\Phi(x) = \frac{1}{2}\|x\|^2$ → Euclidean GD, $D_\Phi$ = squared distance
$\Phi(x) = \sum_i x_i \log x_i$ → KL divergence on simplex (exponentiated gradient)

Beck & Teboulle2003Operations Research Letters

Cesa-Bianchi & Lugosi2006Cambridge University Press

Explore this concept from different angles — like a mathematician would.