2Core Training

⊗Scaled Dot-Product Attention & Transformer Layers

Canonical Papers

Vaswani et al.2017NeurIPS

Single attention head:

\text{Attn}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

where $Q = XW_Q,\ K = XW_K,\ V = XW_V$ . Multi-head attention concatenates several such heads.

A standard transformer block:

\begin{aligned} H' &= \mathrm{MHA}(\mathrm{LN}(H)) + H \\ H^{\text{out}} &= \mathrm{MLP}(\mathrm{LN}(H')) + H' \end{aligned}

Key Equation

\text{Attn}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

GPT-4, Claude, Gemini, Llama: giant stacks of decoder-only transformer blocks with causal self-attention
Stable Diffusion: U-Net with self- and cross-attention between image latents and text embeddings
Sora: diffusion transformer operating on spacetime patches (video tokens)

What is still poorly explained in textbooks and papers:

Geometric picture of Q–K dot products as measuring angles between feature directions, and how softmax turns those into a distribution of "who to copy from"
How multi-head attention effectively builds a set of learned kernels over positions/features, and why this is strictly more flexible than fixed kernels

Explore this concept from different angles — like a mathematician would.

🔄 Same Technique

⚠️ Breaks When

🔧 Invented to Fix

↔️ Mathematical Dual

≈ Analogy