Domain Neighborhood
Attention & Transformers
The sequence model backbone: tokenization, self-attention, positional encodings, and the transformer block that powers modern LLMs.
Recommended Route
Start here, then follow the prerequisites forward.
This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.
- 01Positional Encoding
How transformers represent order: sinusoidal encodings, learned embeddings, and relative-position methods like RoPE.
12 mincodedemoafter Scaled Dot-Product Attention & Transformer LayersCheck Scaled Dot-Product Attention & Transformer Layers first if the symbols feel slippery.
- 02Rotary Position Embeddings (RoPE)
A positional encoding that rotates queries and keys so attention depends on relative position via phase differences.
14 mincodedemoafter Scaled Dot-Product Attention & Transformer LayersWhy this follows: both pages keep the attention transformers / rope thread active.
- 03Scaled Dot-Product Attention & Transformer Layers
The core transformer operation: compute attention weights from query-key dot products, then mix values to copy information across a sequence.
18 mincodedemoafter Maximum Likelihood, Dot ProductWhy this follows: both pages keep the attention transformers thread active.
- 04Layer Normalization & RMSNorm
Normalize one token/example vector across features: LayerNorm centers and scales, while RMSNorm keeps RMS-based scaling without mean-centering.
14 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Residual Connections & Skip ConnectionsWhy this follows: Layer Normalization & RMSNorm uses Scaled Dot-Product Attention & Transformer Layers directly.
- 05Tokenization & Vocabulary Design
How text becomes token IDs: segmentation, BPE/unigram tokenizers, and the tradeoffs that shape cost and capability.
14 mincodedemoafter Maximum Likelihood, Representation Learning & Embedding GeometryWhy this follows: both pages keep the attention transformers thread active.
- 06FlashAttention: IO-Aware Attention
A fused, tiled attention implementation that avoids materializing the full T x T matrix by using an online softmax, reducing memory traffic and speeding up long-context training/inference.
18 mincodedemoafter Efficient Attention at Scale: KV Cache, GQA & FlashAttention, Long Context Engineering: RoPE Scaling, KV Compression & Memory OptimizationWhy this follows: both pages keep the attention transformers thread active.
- 07Grouped-Query Attention: Sharing KV Heads
How multi-head, grouped-query, and multi-query attention differ by the number of key/value heads, and why reducing H_kv shrinks the decoding KV cache by H_kv / H_q.
15 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttentionWhy this follows: it shifts toward Scaled Dot-Product Attention & Transformer Layers while staying in the same neighborhood.
All Published Notebooks
Browse the territory.
Positional Encoding
How transformers represent order: sinusoidal encodings, learned embeddings, and relative-position methods like RoPE.
Rotary Position Embeddings (RoPE)
A positional encoding that rotates queries and keys so attention depends on relative position via phase differences.
Scaled Dot-Product Attention & Transformer Layers
The core transformer operation: compute attention weights from query-key dot products, then mix values to copy information across a sequence.
Layer Normalization & RMSNorm
Normalize one token/example vector across features: LayerNorm centers and scales, while RMSNorm keeps RMS-based scaling without mean-centering.
Tokenization & Vocabulary Design
How text becomes token IDs: segmentation, BPE/unigram tokenizers, and the tradeoffs that shape cost and capability.
FlashAttention: IO-Aware Attention
A fused, tiled attention implementation that avoids materializing the full T x T matrix by using an online softmax, reducing memory traffic and speeding up long-context training/inference.
Grouped-Query Attention: Sharing KV Heads
How multi-head, grouped-query, and multi-query attention differ by the number of key/value heads, and why reducing H_kv shrinks the decoding KV cache by H_kv / H_q.
Efficient Attention at Scale: KV Cache, GQA & FlashAttention
How attention becomes practical at long context: KV caching for decoding, grouped-query attention, and IO-aware kernels like FlashAttention.
Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization
How frontier LLMs stretch context windows: positional extrapolation (RoPE scaling) plus KV cache memory tricks (GQA, paging, quantization, compression).
Advanced Bridges
Use these after the core path.
In Progress