18Representations

↻Rotary Position Embeddings (RoPE)

Canonical Papers

Su et al.2021arXiv

Chen et al.2023arXiv

Peng et al.2024ICLR

Attention is permutation-equivariant by design—without position encoding, transformers can't distinguish token order.

RoPE encodes position as a rotation applied to queries and keys. For 2D subspace with position $p$ :

\tilde q_p = R(\theta_p) q, \quad \tilde k_q = R(\theta_q) k

where $R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$

The key property: rotations compose via relative position:

\tilde q_p^\top \tilde k_q = q^\top R(\theta_p)^\top R(\theta_q) k = q^\top R(\theta_q - \theta_p) k

Full RoPE applies this to multiple 2D pairs at different frequencies $\omega_i = \text{base}^{-2i/d}$ , creating a multi-scale positional ruler.

In complex notation: $\tilde q_p = q \cdot e^{i\theta_p}$ , making relative position a phase difference.

Key Equation

\tilde q_p^\top \tilde k_q = q^\top R(\theta_q - \theta_p) k

GPT-NeoX, Llama 1/2/3, PaLM, Gemini, Claude 3: RoPE is the dominant position encoding for modern LLMs
Enables better length extrapolation than learned absolute positions—models can handle longer contexts than seen in training
Multi-frequency structure naturally represents both local patterns (high ω) and long-range dependencies (low ω)

What is still poorly explained in textbooks and papers:

Why rotation specifically? Because group composition R(θ_p)^T R(θ_q) = R(θ_q - θ_p) automatically produces relative position
How multi-frequency pairs work like clock hands: fast clocks for nearby tokens, slow clocks for distant ones
Why long-context methods (Position Interpolation, YaRN) scale positions—prevents phase wrapping beyond training distribution
Geometric picture: RoPE is equivariance to translation, similar to CNNs but for 1D sequences via rotation group

Explore this concept from different angles — like a mathematician would.

⚠️ Breaks When

🔧 Invented to Fix

🔄 Same Technique

≈ Analogy