6Theory

ΘNeural Tangent Kernel & Infinite-Width Limits

Canonical Papers

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Jacot et al.2018NeurIPS
Read paper →

Core Mathematics

Define network fθ(x)f_\theta(x) with parameters θ\theta. The NTK is:

Θ(x,x)=θfθ(x)θfθ(x)\Theta(x,x') = \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x')

In the infinite-width limit, this kernel becomes deterministic and remains constant during training. Training becomes:

tft(x)=iΘ(x,xi)(ft(xi),yi)f\partial_t f_t(x) = - \sum_{i} \Theta(x,x_i)\, \frac{\partial \ell(f_t(x_i), y_i)}{\partial f}

a linear ODE in function space, just like kernel regression.

Key Equation
Θ(x,x)=θfθ(x)θfθ(x)\Theta(x,x') = \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x')

Interactive Visualization

Why It Matters for Modern Models

  • NTK provides a mathematically clean limit where we can predict learning dynamics and generalization
  • Many mechanistic-interpretability arguments assume behavior "somewhere between" kernel-like and feature-learning regimes

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Most expositions are algebraic; missing is a geometric animation showing how trajectories in function space under NTK differ from genuine feature learning

Connections

Prerequisites

Enables

Next Moves

Explore this concept from different angles — like a mathematician would.