Legacy Concept Lab

FlashAttention: IO-Aware Attention

Enabled training with 100K+ context windows—without this, long context is impractical

Concept 68 of 100EfficiencyPhase 6
#68FlashAttnEfficiency
key equation\text{Memory: } O(N) \text{ vs } O(N^2)
Phase 6: Modern efficiency & inferenceConcept 68 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • Enabled training with 100K+ context windows—without this, long context is impractical
  • Foundational for modern LLMs: used in LLaMA, GPT-4, Claude, etc.
  • Shows that algorithmic innovation can beat hardware by understanding memory hierarchy

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • GPU memory hierarchy matters: SRAM (fast, small) vs HBM (slow, large)
  • Materializing N×N attention wastes memory bandwidth—the bottleneck, not FLOPs
  • Online softmax: you can compute softmax in one pass by tracking running max and sum

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
Memory: O(N) vs O(N2)\text{Memory: } O(N) \text{ vs } O(N^2)

Standard attention materializes O(N2)O(N^2) intermediate matrices:

Attention(Q,K,V)=softmax(QKTd)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

FlashAttention fuses operations, keeping data in SRAM:

  • Tile Q, K, V into blocks that fit in SRAM
  • Compute local softmax, accumulate with online softmax trick
  • Never materialize full N×NN \times N attention matrix

Memory: O(N)O(N) instead of O(N2)O(N^2)
Speed: 2-4× faster than standard attention

Canonical Papers

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao et al.2022NeurIPS
Read paper →

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao2023arXiv
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.