Legacy Concept Lab
FlashAttention: IO-Aware Attention
Enabled training with 100K+ context windows—without this, long context is impractical
#68FlashAttnEfficiency
key equation
\text{Memory: } O(N) \text{ vs } O(N^2)Phase 6: Modern efficiency & inferenceConcept 68 of 100
Why It Matters for Modern Models
- Enabled training with 100K+ context windows—without this, long context is impractical
- Foundational for modern LLMs: used in LLaMA, GPT-4, Claude, etc.
- Shows that algorithmic innovation can beat hardware by understanding memory hierarchy
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- GPU memory hierarchy matters: SRAM (fast, small) vs HBM (slow, large)
- Materializing N×N attention wastes memory bandwidth—the bottleneck, not FLOPs
- Online softmax: you can compute softmax in one pass by tracking running max and sum
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Standard attention materializes intermediate matrices:
FlashAttention fuses operations, keeping data in SRAM:
- Tile Q, K, V into blocks that fit in SRAM
- Compute local softmax, accumulate with online softmax trick
- Never materialize full attention matrix
Memory: instead of
Speed: 2-4× faster than standard attention