Scaled Dot-Product Attention & Transformer Layers
Canonical Papers
Attention Is All You Need
Read paper →Core Mathematics
Single attention head:
where . Multi-head attention concatenates several such heads.
A standard transformer block:
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- GPT-4, Claude, Gemini, Llama: giant stacks of decoder-only transformer blocks with causal self-attention
- Stable Diffusion: U-Net with self- and cross-attention between image latents and text embeddings
- Sora: diffusion transformer operating on spacetime patches (video tokens)
Missing Intuition
What is still poorly explained in textbooks and papers:
- Geometric picture of Q–K dot products as measuring angles between feature directions, and how softmax turns those into a distribution of "who to copy from"
- How multi-head attention effectively builds a set of learned kernels over positions/features, and why this is strictly more flexible than fixed kernels