Domain Neighborhood

Attention & Transformers

The sequence model backbone: tokenization, self-attention, positional encodings, and the transformer block that powers modern LLMs.

12 concepts9 published11 demos

Start with Positional Encoding Search Atlas

Recommended Route

Start here, then follow the prerequisites forward.

This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.

01
Positional Encoding
How transformers represent order: sinusoidal encodings, learned embeddings, and relative-position methods like RoPE.
12 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers
Check Scaled Dot-Product Attention & Transformer Layers first if the symbols feel slippery.
02
Rotary Position Embeddings (RoPE)
A positional encoding that rotates queries and keys so attention depends on relative position via phase differences.
14 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers
Why this follows: both pages keep the attention transformers / rope thread active.
03
Scaled Dot-Product Attention & Transformer Layers
The core transformer operation: compute attention weights from query-key dot products, then mix values to copy information across a sequence.
18 mincodedemoafter Maximum Likelihood, Dot Product
Why this follows: both pages keep the attention transformers thread active.
04
Layer Normalization & RMSNorm
Normalize one token/example vector across features: LayerNorm centers and scales, while RMSNorm keeps RMS-based scaling without mean-centering.
14 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Residual Connections & Skip Connections
Why this follows: Layer Normalization & RMSNorm uses Scaled Dot-Product Attention & Transformer Layers directly.
05
Tokenization & Vocabulary Design
How text becomes token IDs: segmentation, BPE/unigram tokenizers, and the tradeoffs that shape cost and capability.
14 mincodedemoafter Maximum Likelihood, Representation Learning & Embedding Geometry
Why this follows: both pages keep the attention transformers thread active.
06
FlashAttention: IO-Aware Attention
A fused, tiled attention implementation that avoids materializing the full T x T matrix by using an online softmax, reducing memory traffic and speeding up long-context training/inference.
18 mincodedemoafter Efficient Attention at Scale: KV Cache, GQA & FlashAttention, Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization
Why this follows: both pages keep the attention transformers thread active.
07
Grouped-Query Attention: Sharing KV Heads
How multi-head, grouped-query, and multi-query attention differ by the number of key/value heads, and why reducing H_kv shrinks the decoding KV cache by H_kv / H_q.
15 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttention
Why this follows: it shifts toward Scaled Dot-Product Attention & Transformer Layers while staying in the same neighborhood.

All Published Notebooks

Browse the territory.

Positional Encoding

How transformers represent order: sinusoidal encodings, learned embeddings, and relative-position methods like RoPE.

Level 212 mindemo

Rotary Position Embeddings (RoPE)

A positional encoding that rotates queries and keys so attention depends on relative position via phase differences.

Level 314 mindemo

Scaled Dot-Product Attention & Transformer Layers

The core transformer operation: compute attention weights from query-key dot products, then mix values to copy information across a sequence.

Level 318 mindemo

Layer Normalization & RMSNorm

Normalize one token/example vector across features: LayerNorm centers and scales, while RMSNorm keeps RMS-based scaling without mean-centering.

Level 314 mindemo

Tokenization & Vocabulary Design

How text becomes token IDs: segmentation, BPE/unigram tokenizers, and the tradeoffs that shape cost and capability.

Level 314 mindemo

FlashAttention: IO-Aware Attention

A fused, tiled attention implementation that avoids materializing the full T x T matrix by using an online softmax, reducing memory traffic and speeding up long-context training/inference.

Level 418 mindemo

Grouped-Query Attention: Sharing KV Heads

How multi-head, grouped-query, and multi-query attention differ by the number of key/value heads, and why reducing H_kv shrinks the decoding KV cache by H_kv / H_q.

Level 415 mindemo

Efficient Attention at Scale: KV Cache, GQA & FlashAttention

How attention becomes practical at long context: KV caching for decoding, grouped-query attention, and IO-aware kernels like FlashAttention.

Level 420 mindemo

Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization

How frontier LLMs stretch context windows: positional extrapolation (RoPE scaling) plus KV cache memory tricks (GQA, paging, quantization, compression).

Level 422 mindemo

Advanced Bridges

Use these after the core path.

FlashAttention: IO-Aware Attention Grouped-Query Attention: Sharing KV Heads Efficient Attention at Scale: KV Cache, GQA & FlashAttention Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization

In Progress

Notebooks still below the publish bar.

Residual Connections & Skip ConnectionsSwiGLU: Gated MLP Blocks in TransformersSSM Hybrids: Fixed-State Sequence Models for Long Context