Legacy Concept Lab

Activation Checkpointing & Memory Efficiency

Essential for training large models—without it, you can't fit 100B models in GPU memory

Concept 79 of 100EfficiencyPhase 6

#79CheckpointingEfficiency

key equation\text{Memory: } O(\sqrt{L}) \text{ vs } O(L)

Phase 6: Modern efficiency & inferenceConcept 79 of 100

Why It Matters for Modern Models

Essential for training large models—without it, you can't fit 100B models in GPU memory
Every major training framework (PyTorch, JAX) uses this technique
Memory-compute trade-off is fundamental: pay with one to save the other

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\text{Memory: } O(\sqrt{L}) \text{ vs } O(L)

Memory problem: Storing activations for backprop requires $O(L \cdot B \cdot d)$ memory.

Gradient checkpointing: Recompute instead of store:

Memory: $O(\sqrt{L})$ instead of $O(L)$
Compute: 33% overhead (recompute forward pass once)

Selective checkpointing: Only checkpoint expensive layers (attention).

Trade-off: $\text{Memory} \times \text{Compute} \geq \text{constant}$

Chen et al.2016arXiv

Korthikanti et al.2022MLSys

Explore this concept from different angles — like a mathematician would.