Legacy Concept Lab

Activation Checkpointing & Memory Efficiency

Essential for training large models—without it, you can't fit 100B models in GPU memory

Concept 79 of 100EfficiencyPhase 6
#79CheckpointingEfficiency
key equation\text{Memory: } O(\sqrt{L}) \text{ vs } O(L)
Phase 6: Modern efficiency & inferenceConcept 79 of 100

Why It Matters for Modern Models

  • Essential for training large models—without it, you can't fit 100B models in GPU memory
  • Every major training framework (PyTorch, JAX) uses this technique
  • Memory-compute trade-off is fundamental: pay with one to save the other

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Backprop needs activations from forward pass; normally we store all of them
  • Checkpointing says: "just save some, recompute the rest when needed"
  • Optimal checkpoint spacing is √L layers—minimizes memory × compute product

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
Memory: O(L) vs O(L)\text{Memory: } O(\sqrt{L}) \text{ vs } O(L)

Memory problem: Storing activations for backprop requires O(LBd)O(L \cdot B \cdot d) memory.

Gradient checkpointing: Recompute instead of store:

  • Forward: Save only checkpoint activations (every L\sqrt{L} layers)
  • Backward: Recompute activations from nearest checkpoint

Memory: O(L)O(\sqrt{L}) instead of O(L)O(L)
Compute: 33% overhead (recompute forward pass once)

Selective checkpointing: Only checkpoint expensive layers (attention).

Trade-off: Memory×Computeconstant\text{Memory} \times \text{Compute} \geq \text{constant}

Canonical Papers

Training Deep Nets with Sublinear Memory Cost

Chen et al.2016arXiv
Read paper →

Reducing Activation Recomputation in Large Transformer Models

Korthikanti et al.2022MLSys
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.