Legacy Concept Lab
Activation Checkpointing & Memory Efficiency
Essential for training large models—without it, you can't fit 100B models in GPU memory
#79CheckpointingEfficiency
key equation
\text{Memory: } O(\sqrt{L}) \text{ vs } O(L)Phase 6: Modern efficiency & inferenceConcept 79 of 100
Why It Matters for Modern Models
- Essential for training large models—without it, you can't fit 100B models in GPU memory
- Every major training framework (PyTorch, JAX) uses this technique
- Memory-compute trade-off is fundamental: pay with one to save the other
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Backprop needs activations from forward pass; normally we store all of them
- Checkpointing says: "just save some, recompute the rest when needed"
- Optimal checkpoint spacing is √L layers—minimizes memory × compute product
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Memory problem: Storing activations for backprop requires memory.
Gradient checkpointing: Recompute instead of store:
- Forward: Save only checkpoint activations (every layers)
- Backward: Recompute activations from nearest checkpoint
Memory: instead of
Compute: 33% overhead (recompute forward pass once)
Selective checkpointing: Only checkpoint expensive layers (attention).
Trade-off: