Domain Neighborhood

Attention & Transformers

The sequence model backbone: tokenization, self-attention, positional encodings, and the transformer block that powers modern LLMs.

12 concepts9 published11 demos

Recommended Route

This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.

  1. 01
    Positional Encoding

    How transformers represent order: sinusoidal encodings, learned embeddings, and relative-position methods like RoPE.

    12 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers

    Check Scaled Dot-Product Attention & Transformer Layers first if the symbols feel slippery.

  2. 02
    Rotary Position Embeddings (RoPE)

    A positional encoding that rotates queries and keys so attention depends on relative position via phase differences.

    14 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers

    Why this follows: both pages keep the attention transformers / rope thread active.

  3. 03
    Scaled Dot-Product Attention & Transformer Layers

    The core transformer operation: compute attention weights from query-key dot products, then mix values to copy information across a sequence.

    18 mincodedemoafter Maximum Likelihood, Dot Product

    Why this follows: both pages keep the attention transformers thread active.

  4. 04
    Layer Normalization & RMSNorm

    Normalize one token/example vector across features: LayerNorm centers and scales, while RMSNorm keeps RMS-based scaling without mean-centering.

    14 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Residual Connections & Skip Connections

    Why this follows: Layer Normalization & RMSNorm uses Scaled Dot-Product Attention & Transformer Layers directly.

  5. 05
    Tokenization & Vocabulary Design

    How text becomes token IDs: segmentation, BPE/unigram tokenizers, and the tradeoffs that shape cost and capability.

    14 mincodedemoafter Maximum Likelihood, Representation Learning & Embedding Geometry

    Why this follows: both pages keep the attention transformers thread active.

  6. 06
    FlashAttention: IO-Aware Attention

    A fused, tiled attention implementation that avoids materializing the full T x T matrix by using an online softmax, reducing memory traffic and speeding up long-context training/inference.

    18 mincodedemoafter Efficient Attention at Scale: KV Cache, GQA & FlashAttention, Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization

    Why this follows: both pages keep the attention transformers thread active.

  7. 07
    Grouped-Query Attention: Sharing KV Heads

    How multi-head, grouped-query, and multi-query attention differ by the number of key/value heads, and why reducing H_kv shrinks the decoding KV cache by H_kv / H_q.

    15 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttention

    Why this follows: it shifts toward Scaled Dot-Product Attention & Transformer Layers while staying in the same neighborhood.

All Published Notebooks

Browse the territory.

Positional Encoding

How transformers represent order: sinusoidal encodings, learned embeddings, and relative-position methods like RoPE.

Level 212 mindemo

Rotary Position Embeddings (RoPE)

A positional encoding that rotates queries and keys so attention depends on relative position via phase differences.

Level 314 mindemo

Scaled Dot-Product Attention & Transformer Layers

The core transformer operation: compute attention weights from query-key dot products, then mix values to copy information across a sequence.

Level 318 mindemo

Layer Normalization & RMSNorm

Normalize one token/example vector across features: LayerNorm centers and scales, while RMSNorm keeps RMS-based scaling without mean-centering.

Level 314 mindemo

Tokenization & Vocabulary Design

How text becomes token IDs: segmentation, BPE/unigram tokenizers, and the tradeoffs that shape cost and capability.

Level 314 mindemo

FlashAttention: IO-Aware Attention

A fused, tiled attention implementation that avoids materializing the full T x T matrix by using an online softmax, reducing memory traffic and speeding up long-context training/inference.

Level 418 mindemo

Grouped-Query Attention: Sharing KV Heads

How multi-head, grouped-query, and multi-query attention differ by the number of key/value heads, and why reducing H_kv shrinks the decoding KV cache by H_kv / H_q.

Level 415 mindemo

Efficient Attention at Scale: KV Cache, GQA & FlashAttention

How attention becomes practical at long context: KV caching for decoding, grouped-query attention, and IO-aware kernels like FlashAttention.

Level 420 mindemo

Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization

How frontier LLMs stretch context windows: positional extrapolation (RoPE scaling) plus KV cache memory tricks (GQA, paging, quantization, compression).

Level 422 mindemo

Advanced Bridges

Use these after the core path.

In Progress

Notebooks still below the publish bar.

Residual Connections & Skip ConnectionsSwiGLU: Gated MLP Blocks in TransformersSSM Hybrids: Fixed-State Sequence Models for Long Context