Optimization

Gradient descent as physics in the loss landscape

The Loss Landscape

Neural network training is optimization in high-dimensional space. The loss function defines a landscape, and we seek to find its valleys — points where the model performs well.

But this landscape is not simple. It has saddle points, narrow ravines, and sharp minima that generalize poorly. The optimizer we choose determines how we navigate this terrain.

Vanilla Gradient Descent

The simplest approach: move opposite to the gradient. The update rule is almost trivially simple:

SGD Update

θt+1 = θt - η∇L(θt)

Watch how vanilla SGD struggles with the narrow ravine of the Rosenbrock function. It oscillates perpendicular to the valley while making slow progress along it.

Learning Rate:0.0010

Momentum

The physics analogy: a ball rolling downhill accumulates velocity. Momentum smooths out oscillations and accelerates through flat regions:

Momentum Update

vt = βvt-1 + η∇L(θt)
θt+1 = θt - vt

Select "Momentum" above and observe how the trajectory becomes smoother, cutting through the ravine instead of bouncing between walls.

Momentum β:0.90

Adam: Adaptive Moments

Adam combines momentum with adaptive per-parameter learning rates. It maintains both first and second moment estimates of the gradient:

Adam Update

mt = β₁mt-1 + (1-β₁)gt
vt = β₂vt-1 + (1-β₂)gt²
θt+1 = θt - η·m̂t/√(v̂t + ε)

The second moment acts as a per-parameter learning rate: dimensions with large gradients get smaller steps, preventing overshooting.

Muon: Orthogonal Updates

Muon takes a different approach: orthogonalize the update matrix. This prevents features from collapsing and maintains diversity in what neurons learn.

Muon Philosophy

Instead of scaling gradients, Muon projects updates onto the Stiefel manifold of orthogonal matrices, ensuring weight matrices maintain orthogonality.

This is particularly powerful for large language models, where maintaining diverse feature representations is crucial.

Edge of Stability

A surprising phenomenon: neural networks train at the "edge of stability" where the loss Hessian's largest eigenvalue hovers at 2/η (learning rate):

Sharpness Dynamics

λmax(∇²L) ≈ 2/η during training
Loss decreases non-monotonically

The optimizer self-organizes to this critical point, where conventional stability analysis predicts divergence but training succeeds anyway.

Grokking

Sometimes models suddenly generalize long after perfectly fitting training data. This "grokking" reveals a phase transition in learning:

Phase Transition

Phase 1: Memorization (training loss → 0, test loss high)
Phase 2: Comprehension (test loss suddenly drops)

The model first memorizes, then discovers the underlying pattern. Weight decay and longer training push models toward generalizing solutions.

DPO vs RLHF

Aligning language models to human preferences. RLHF trains a reward model then optimizes against it. DPO directly optimizes from preference pairs:

DPO Loss

L = -log σ(β log(π(yw)/πref(yw)) - β log(π(yl)/πref(yl)))

DPO is simpler: no reward model, no RL. It implicitly defines a reward through the optimal policy, making alignment more stable and efficient.

3D Loss Landscape

Visualizing the high-dimensional loss surface by projecting onto random directions. Sharp minima may generalize poorly; flat minima are more robust:

Sharpness & Generalization

L(θ + δ) ≈ L(θ) + ½δT
Large eigenvalues of H → sharp minimum

The landscape visualization helps understand why certain optimizers find better solutions and how regularization affects the geometry of minima.

Backprop Through Attention

Gradients flow through the attention mechanism in complex patterns. Understanding this flow reveals why certain architectures train better:

Gradient Flow

∂L/∂Q = ∂L/∂A · ∂A/∂Q where A = softmax(QKT/√d)
Gradients must flow through softmax Jacobian

Skip connections provide direct gradient highways, explaining why residual architectures are essential for deep transformers.

Task Vectors

Fine-tuned models differ from base models by a "task vector" in weight space. These vectors support arithmetic operations:

Model Arithmetic

τ = θfine - θbase
θnew = θbase + α·τ1 + β·τ2

Add task vectors to combine capabilities. Negate to remove behaviors. This enables editing model knowledge without retraining.

Neural Scaling Laws

Model performance follows predictable power laws with compute, data, and parameters:

Chinchilla Scaling

L ∝ N + D + L
Optimal: N ∝ C0.5, D ∝ C0.5

These laws enable predicting performance of larger models and optimal allocation of compute budget between model size and training data.