6Theory

ΘNeural Tangent Kernel & Infinite-Width Limits

Canonical Papers

Jacot et al.2018NeurIPS

Define network $f_\theta(x)$ with parameters $\theta$ . The NTK is:

\Theta(x,x') = \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x')

In the infinite-width limit, this kernel becomes deterministic and remains constant during training. Training becomes:

\partial_t f_t(x) = - \sum_{i} \Theta(x,x_i)\, \frac{\partial \ell(f_t(x_i), y_i)}{\partial f}

a linear ODE in function space, just like kernel regression.

Key Equation

\Theta(x,x') = \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x')

NTK provides a mathematically clean limit where we can predict learning dynamics and generalization
Many mechanistic-interpretability arguments assume behavior "somewhere between" kernel-like and feature-learning regimes

What is still poorly explained in textbooks and papers:

Most expositions are algebraic; missing is a geometric animation showing how trajectories in function space under NTK differ from genuine feature learning

Explore this concept from different angles — like a mathematician would.

↔️ Mathematical Dual

🔄 Same Technique

≈ Analogy

⚠️ Breaks When