Mathematical Foundations

34 core mathematical concepts that explain how modern AI systems work. From maximum likelihood to multimodal vision-language models, these ideas power GPT-4, Claude, Gemini, Llama, Stable Diffusion, and Sora.

Recommended Study Order

Build understanding from fundamentals to frontier techniques. Each phase builds on the previous one.

1

Core probabilistic training + transformers

2

Optimization & generalization

3

Generative modeling families

4

Representation & interpretability

5

Scaling & alignment

6

Efficiency & theory

All 34 Concepts

Click any concept to explore its canonical papers, core math, why it matters for modern models, and missing intuition.

1

ML/CE/KL

Maximum Likelihood, Cross-Entropy & KL Divergence

Core Training1 paper
2

Attention

Scaled Dot-Product Attention & Transformer Layers

Core Training1 paper
3

Adam

Adam & Adaptive Gradient Methods

Optimization2 papers
4

Sharpness

Loss Landscapes, Sharpness & Flat Minima

Optimization2 papers
5

Double Descent

Overparameterization & Generalization, Double Descent

Optimization2 papers
6Θ

NTK

Neural Tangent Kernel & Infinite-Width Limits

Theory1 paper
7

VAEs

Variational Autoencoders & Variational Inference

Generative Models1 paper
8

GANs

GANs & Adversarial Divergence Minimization

Generative Models2 papers
9

Diffusion

Diffusion, Score-Based Models & Flow Matching

Generative Models4 papers
10

Embeddings

Representation Learning & Embedding Geometry

Representations1 paper
11

Superposition

Superposition, Sparse Features & Monosemanticity

Representations2 papers
12

Probing

Probing, Linear Classifier Probes & Activation Analysis

Representations2 papers
13

Circuits

Transformer Circuits, Induction Heads & Mechanistic Interpretability

Representations2 papers
14

Scaling

Scaling Laws & Emergent Abilities

Scaling & Alignment3 papers
15

RLHF

Preference-Based Alignment: RLHF, Reward Modeling, Constitutional AI

Scaling & Alignment3 papers
16

Efficiency

Efficiency: Quantization, Distillation, LoRA & Sparse MoE

Efficiency3 papers
17

Theory

Theoretical Foundations: PAC Learning, MDL & Information Bottleneck

Theory3 papers
19

Efficient Attention

Efficient Attention at Scale: KV Cache, GQA & FlashAttention

Efficiency3 papers
18

RoPE

Rotary Position Embeddings (RoPE)

Representations3 papers
20

Speculative Decoding

Speculative Decoding: Lossless Multi-Token Generation

Efficiency3 papers
21

LLM Serving

LLM Serving at Scale: Prefill, Decode & Continuous Batching

Efficiency3 papers
22🔀

MoE

Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism

Efficiency3 papers
23

MoE Serving

MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism

Efficiency3 papers
24🎯

DPO

Direct Preference Optimization: RL-Free Alignment from Human Preferences

Scaling & Alignment3 papers
25👍

KTO

KTO: Alignment from Binary Feedback via Human-Aware Losses

Scaling & Alignment3 papers
26⚠️

Reward Hacking

Reward Hacking & Overoptimization: Goodhart's Law in Preference Optimization

Scaling & Alignment3 papers
27🔍

Sparse Autoencoders

Sparse Autoencoders at Scale: Feature Dictionaries for Mechanistic Interpretability

Representations3 papers
28🔬

Circuit Discovery

Automated Circuit Discovery: Patching, Attribution & Decomposition at Scale

Representations3 papers
29🎚️

Activation Steering

Activation Steering: Feature-Guided Interventions for Inference-Time Control

Representations3 papers
30📏

Long Context

Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization

Efficiency3 papers
31🔀

SSMs & Hybrids

State Space Models & Hybrid Architectures: Mamba-2, Jamba, Griffin

Core Training3 papers
32🖼️

Multimodal VLP

Multimodal Foundations: Vision Encoders, Contrastive Learning & Cross-Attention Fusion

Representations3 papers
33🔤

Tokens

Tokenization & Vocabulary Design

Representations3 papers
34🎲

Decoding

Decoding & Sampling: Temperature, Top-p & Inference-Time Control

Core Training3 papers

Why These 34 Concepts?

Complete Coverage

Together, these concepts explain the core mechanisms behind language models, diffusion models, and multimodal systems.

Missing Intuition

Each concept includes what's still poorly explained in textbooks and papers - the intuition gaps we aim to fill.

Connected Knowledge

See how concepts build on each other. Understand prerequisites and what each idea unlocks.