Legacy Concept Lab
Grouped Query Attention (GQA)
Used in LLaMA 2, Mistral—critical for efficient long-context inference
#80GQAEfficiency
key equation
\text{KV cache} = \frac{h}{g} \times \text{MHA cache}Phase 2: Architecture fundamentalsConcept 80 of 100
Why It Matters for Modern Models
- Used in LLaMA 2, Mistral—critical for efficient long-context inference
- Reduces KV cache without significant quality loss
- Enables running larger context windows on consumer hardware
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Not all heads need unique K, V—sharing works surprisingly well
- GQA with g=h is MHA; g=1 is MQA; g in between is the sweet spot
- Speedup comes from smaller memory reads, not fewer FLOPs
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Multi-Head Attention: Each head has own Q, K, V.
Multi-Query Attention: All heads share K, V; each has own Q.
Grouped Query Attention: Groups of heads share K, V:
where heads in group share the same K, V projections.
KV cache savings: Memory reduced by factor (h heads, g groups).