2Core Training

Scaled Dot-Product Attention & Transformer Layers

Canonical Papers

Attention Is All You Need

Vaswani et al.2017NeurIPS
Read paper →

Core Mathematics

Single attention head:

Attn(Q,K,V)=softmax ⁣(QKdk)V\text{Attn}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

where Q=XWQ, K=XWK, V=XWVQ = XW_Q,\ K = XW_K,\ V = XW_V. Multi-head attention concatenates several such heads.

A standard transformer block:

H=MHA(LN(H))+HHout=MLP(LN(H))+H\begin{aligned} H' &= \mathrm{MHA}(\mathrm{LN}(H)) + H \\ H^{\text{out}} &= \mathrm{MLP}(\mathrm{LN}(H')) + H' \end{aligned}
Key Equation
Attn(Q,K,V)=softmax ⁣(QKdk)V\text{Attn}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Interactive Visualization

Why It Matters for Modern Models

  • GPT-4, Claude, Gemini, Llama: giant stacks of decoder-only transformer blocks with causal self-attention
  • Stable Diffusion: U-Net with self- and cross-attention between image latents and text embeddings
  • Sora: diffusion transformer operating on spacetime patches (video tokens)

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Geometric picture of Q–K dot products as measuring angles between feature directions, and how softmax turns those into a distribution of "who to copy from"
  • How multi-head attention effectively builds a set of learned kernels over positions/features, and why this is strictly more flexible than fixed kernels

Connections