Legacy Concept Lab

FlashAttention: IO-Aware Attention

Enabled training with 100K+ context windows—without this, long context is impractical

Concept 68 of 100EfficiencyPhase 6

#68FlashAttnEfficiency

key equation\text{Memory: } O(N) \text{ vs } O(N^2)

Phase 6: Modern efficiency & inferenceConcept 68 of 100

Why It Matters for Modern Models

Enabled training with 100K+ context windows—without this, long context is impractical
Foundational for modern LLMs: used in LLaMA, GPT-4, Claude, etc.
Shows that algorithmic innovation can beat hardware by understanding memory hierarchy

What is still poorly explained in textbooks and papers:

GPU memory hierarchy matters: SRAM (fast, small) vs HBM (slow, large)
Materializing N×N attention wastes memory bandwidth—the bottleneck, not FLOPs
Online softmax: you can compute softmax in one pass by tracking running max and sum

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\text{Memory: } O(N) \text{ vs } O(N^2)

Standard attention materializes $O(N^2)$ intermediate matrices:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

FlashAttention fuses operations, keeping data in SRAM:

Memory: $O(N)$ instead of $O(N^2)$
Speed: 2-4× faster than standard attention

Dao et al.2022NeurIPS

Dao2023arXiv

Explore this concept from different angles — like a mathematician would.