Mathematical Foundations
34 core mathematical concepts that explain how modern AI systems work. From maximum likelihood to multimodal vision-language models, these ideas power GPT-4, Claude, Gemini, Llama, Stable Diffusion, and Sora.
Recommended Study Order
Build understanding from fundamentals to frontier techniques. Each phase builds on the previous one.
Optimization & generalization
Representation & interpretability
Efficiency & theory
All 34 Concepts
Click any concept to explore its canonical papers, core math, why it matters for modern models, and missing intuition.
ML/CE/KL
Maximum Likelihood, Cross-Entropy & KL Divergence
Attention
Scaled Dot-Product Attention & Transformer Layers
Adam
Adam & Adaptive Gradient Methods
Sharpness
Loss Landscapes, Sharpness & Flat Minima
Double Descent
Overparameterization & Generalization, Double Descent
NTK
Neural Tangent Kernel & Infinite-Width Limits
VAEs
Variational Autoencoders & Variational Inference
GANs
GANs & Adversarial Divergence Minimization
Diffusion
Diffusion, Score-Based Models & Flow Matching
Embeddings
Representation Learning & Embedding Geometry
Superposition
Superposition, Sparse Features & Monosemanticity
Probing
Probing, Linear Classifier Probes & Activation Analysis
Circuits
Transformer Circuits, Induction Heads & Mechanistic Interpretability
Scaling
Scaling Laws & Emergent Abilities
RLHF
Preference-Based Alignment: RLHF, Reward Modeling, Constitutional AI
Efficiency
Efficiency: Quantization, Distillation, LoRA & Sparse MoE
Theory
Theoretical Foundations: PAC Learning, MDL & Information Bottleneck
Efficient Attention
Efficient Attention at Scale: KV Cache, GQA & FlashAttention
RoPE
Rotary Position Embeddings (RoPE)
Speculative Decoding
Speculative Decoding: Lossless Multi-Token Generation
LLM Serving
LLM Serving at Scale: Prefill, Decode & Continuous Batching
MoE
Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism
MoE Serving
MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism
DPO
Direct Preference Optimization: RL-Free Alignment from Human Preferences
KTO
KTO: Alignment from Binary Feedback via Human-Aware Losses
Reward Hacking
Reward Hacking & Overoptimization: Goodhart's Law in Preference Optimization
Sparse Autoencoders
Sparse Autoencoders at Scale: Feature Dictionaries for Mechanistic Interpretability
Circuit Discovery
Automated Circuit Discovery: Patching, Attribution & Decomposition at Scale
Activation Steering
Activation Steering: Feature-Guided Interventions for Inference-Time Control
Long Context
Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization
SSMs & Hybrids
State Space Models & Hybrid Architectures: Mamba-2, Jamba, Griffin
Multimodal VLP
Multimodal Foundations: Vision Encoders, Contrastive Learning & Cross-Attention Fusion
Tokens
Tokenization & Vocabulary Design
Decoding
Decoding & Sampling: Temperature, Top-p & Inference-Time Control
Why These 34 Concepts?
Complete Coverage
Together, these concepts explain the core mechanisms behind language models, diffusion models, and multimodal systems.
Missing Intuition
Each concept includes what's still poorly explained in textbooks and papers - the intuition gaps we aim to fill.
Connected Knowledge
See how concepts build on each other. Understand prerequisites and what each idea unlocks.