18Representations

Rotary Position Embeddings (RoPE)

Canonical Papers

RoFormer: Enhanced Transformer with Rotary Position Embedding

Su et al.2021arXiv
Read paper →

Extending Context Window via Positional Interpolation

Chen et al.2023arXiv
Read paper →

YaRN: Efficient Context Window Extension

Peng et al.2024ICLR
Read paper →

Core Mathematics

Attention is permutation-equivariant by design—without position encoding, transformers can't distinguish token order.

RoPE encodes position as a rotation applied to queries and keys. For 2D subspace with position pp:

q~p=R(θp)q,k~q=R(θq)k\tilde q_p = R(\theta_p) q, \quad \tilde k_q = R(\theta_q) k

where R(θ)=(cosθsinθsinθcosθ)R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}

The key property: rotations compose via relative position:

q~pk~q=qR(θp)R(θq)k=qR(θqθp)k\tilde q_p^\top \tilde k_q = q^\top R(\theta_p)^\top R(\theta_q) k = q^\top R(\theta_q - \theta_p) k

Full RoPE applies this to multiple 2D pairs at different frequencies ωi=base2i/d\omega_i = \text{base}^{-2i/d}, creating a multi-scale positional ruler.

In complex notation: q~p=qeiθp\tilde q_p = q \cdot e^{i\theta_p}, making relative position a phase difference.

Key Equation
q~pk~q=qR(θqθp)k\tilde q_p^\top \tilde k_q = q^\top R(\theta_q - \theta_p) k

Interactive Visualization

Why It Matters for Modern Models

  • GPT-NeoX, Llama 1/2/3, PaLM, Gemini, Claude 3: RoPE is the dominant position encoding for modern LLMs
  • Enables better length extrapolation than learned absolute positions—models can handle longer contexts than seen in training
  • Multi-frequency structure naturally represents both local patterns (high ω) and long-range dependencies (low ω)

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Why rotation specifically? Because group composition R(θ_p)^T R(θ_q) = R(θ_q - θ_p) automatically produces relative position
  • How multi-frequency pairs work like clock hands: fast clocks for nearby tokens, slow clocks for distant ones
  • Why long-context methods (Position Interpolation, YaRN) scale positions—prevents phase wrapping beyond training distribution
  • Geometric picture: RoPE is equivariance to translation, similar to CNNs but for 1D sequences via rotation group

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.