Optimization
Gradient descent as physics in the loss landscape
The Loss Landscape
Neural network training is optimization in high-dimensional space. The loss function defines a landscape, and we seek to find its valleys — points where the model performs well.
But this landscape is not simple. It has saddle points, narrow ravines, and sharp minima that generalize poorly. The optimizer we choose determines how we navigate this terrain.
Vanilla Gradient Descent
The simplest approach: move opposite to the gradient. The update rule is almost trivially simple:
θt+1 = θt - η∇L(θt)
Watch how vanilla SGD struggles with the narrow ravine of the Rosenbrock function. It oscillates perpendicular to the valley while making slow progress along it.
Momentum
The physics analogy: a ball rolling downhill accumulates velocity. Momentum smooths out oscillations and accelerates through flat regions:
vt = βvt-1 + η∇L(θt)
θt+1 = θt - vt
Select "Momentum" above and observe how the trajectory becomes smoother, cutting through the ravine instead of bouncing between walls.
Adam: Adaptive Moments
Adam combines momentum with adaptive per-parameter learning rates. It maintains both first and second moment estimates of the gradient:
mt = β₁mt-1 + (1-β₁)gt
vt = β₂vt-1 + (1-β₂)gt²
θt+1 = θt - η·m̂t/√(v̂t + ε)
The second moment acts as a per-parameter learning rate: dimensions with large gradients get smaller steps, preventing overshooting.
Muon: Orthogonal Updates
Muon takes a different approach: orthogonalize the update matrix. This prevents features from collapsing and maintains diversity in what neurons learn.
Instead of scaling gradients, Muon projects updates onto the Stiefel manifold of orthogonal matrices, ensuring weight matrices maintain orthogonality.
This is particularly powerful for large language models, where maintaining diverse feature representations is crucial.
Edge of Stability
A surprising phenomenon: neural networks train at the "edge of stability" where the loss Hessian's largest eigenvalue hovers at 2/η (learning rate):
λmax(∇²L) ≈ 2/η during training
Loss decreases non-monotonically
The optimizer self-organizes to this critical point, where conventional stability analysis predicts divergence but training succeeds anyway.
Grokking
Sometimes models suddenly generalize long after perfectly fitting training data. This "grokking" reveals a phase transition in learning:
Phase 1: Memorization (training loss → 0, test loss high)
Phase 2: Comprehension (test loss suddenly drops)
The model first memorizes, then discovers the underlying pattern. Weight decay and longer training push models toward generalizing solutions.
DPO vs RLHF
Aligning language models to human preferences. RLHF trains a reward model then optimizes against it. DPO directly optimizes from preference pairs:
L = -log σ(β log(π(yw)/πref(yw)) - β log(π(yl)/πref(yl)))
DPO is simpler: no reward model, no RL. It implicitly defines a reward through the optimal policy, making alignment more stable and efficient.
3D Loss Landscape
Visualizing the high-dimensional loss surface by projecting onto random directions. Sharp minima may generalize poorly; flat minima are more robust:
L(θ + δ) ≈ L(θ) + ½δTHδ
Large eigenvalues of H → sharp minimum
The landscape visualization helps understand why certain optimizers find better solutions and how regularization affects the geometry of minima.
Backprop Through Attention
Gradients flow through the attention mechanism in complex patterns. Understanding this flow reveals why certain architectures train better:
∂L/∂Q = ∂L/∂A · ∂A/∂Q where A = softmax(QKT/√d)
Gradients must flow through softmax Jacobian
Skip connections provide direct gradient highways, explaining why residual architectures are essential for deep transformers.
Task Vectors
Fine-tuned models differ from base models by a "task vector" in weight space. These vectors support arithmetic operations:
τ = θfine - θbase
θnew = θbase + α·τ1 + β·τ2
Add task vectors to combine capabilities. Negate to remove behaviors. This enables editing model knowledge without retraining.
Neural Scaling Laws
Model performance follows predictable power laws with compute, data, and parameters:
L ∝ N-α + D-β + L∞
Optimal: N ∝ C0.5, D ∝ C0.5
These laws enable predicting performance of larger models and optimal allocation of compute budget between model size and training data.