Legacy Concept Lab

Grouped Query Attention (GQA)

Used in LLaMA 2, Mistral—critical for efficient long-context inference

Concept 80 of 100EfficiencyPhase 2

Open New Notebook All Foundations

#80GQAEfficiency

key equation\text{KV cache} = \frac{h}{g} \times \text{MHA cache}

Phase 2: Architecture fundamentalsConcept 80 of 100

Why It Matters for Modern Models

Used in LLaMA 2, Mistral—critical for efficient long-context inference
Reduces KV cache without significant quality loss
Enables running larger context windows on consumer hardware

What Tutorials Skip

What is still poorly explained in textbooks and papers:

Not all heads need unique K, V—sharing works surprisingly well
GQA with g=h is MHA; g=1 is MQA; g in between is the sweet spot
Speedup comes from smaller memory reads, not fewer FLOPs

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\text{KV cache} = \frac{h}{g} \times \text{MHA cache}

Multi-Head Attention: Each head has own Q, K, V.
Multi-Query Attention: All heads share K, V; each has own Q.
Grouped Query Attention: Groups of heads share K, V:

\text{GQA}: \text{head}_i = \text{Attention}(XW^Q_i, XW^K_{g(i)}, XW^V_{g(i)})

where heads in group $g$ share the same K, V projections.

KV cache savings: Memory reduced by factor $\frac{h}{g}$ (h heads, g groups).

Canonical Papers

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie et al.2023EMNLP

Read paper →

Connections

Prerequisites

⚙Efficient Attention 📏Long Context

Next Moves

Explore this concept from different angles — like a mathematician would.