Sequence Modeling
From recurrence to attention to selective state spaces
The Challenge of Sequences
Sequential data is everywhere: language, audio, time series, code. The fundamental challenge is how to process variable-length inputs while maintaining computational efficiency and capturing long-range dependencies.
Three paradigms have emerged, each with distinct tradeoffs between expressiveness, parallelism, and memory complexity.
Self-Attention
The Transformer's key innovation: every token can directly attend to every other token. No recurrence needed. The attention weight from token i to token j is:
αij = softmax(Qi · KjT / √d)
This enables parallel training but comes with O(n²) memory complexity in sequence length. For long sequences, this becomes prohibitive.
RoPE: Rotary Position Embeddings
How do transformers know token order? RoPE encodes position by rotating query and key vectors in 2D subspaces. Each position gets a unique rotation angle:
R(θm) = [cos(mθ), -sin(mθ); sin(mθ), cos(mθ)]
The key insight: relative positions are encoded in the angle between rotated vectors, enabling length generalization beyond training context.
KV Cache
During autoregressive generation, we only need to compute attention for the new token. The KV cache stores previously computed key and value vectors:
At step t: Kcache = [K1, ..., Kt-1]
Only compute Qt, Kt, Vt for new token
This reduces per-token compute from O(n²) to O(n), but requires O(n·d) memory per layer. For long contexts, this memory cost dominates.
Grouped Query Attention
Multi-Head Attention (MHA) uses separate K, V heads per query head. GQA shares K, V across groups of query heads, reducing KV cache size:
MHA: nheads × dhead for K and V each
GQA: ngroups × dhead (where groups < heads)
MQA (Multi-Query Attention) is the extreme case: all query heads share a single K, V pair. GQA balances memory efficiency with model quality.
SwiGLU Activation
The feed-forward network in each transformer block uses gated activations. SwiGLU combines Swish activation with a gating mechanism:
SwiGLU(x) = Swish(xW1) ⊙ (xW2)
where Swish(x) = x · σ(x)
The gating allows the network to selectively pass or block information, improving training stability and final performance over ReLU.
Mixture of Experts
Scale model capacity without scaling compute. MoE replaces the FFN with multiple "expert" networks, routing each token to a subset:
y = Σ G(x)i · Experti(x)
where G(x) = TopK(softmax(x · Wgate))
Typically only 1-2 experts are active per token (sparse routing), achieving massive parameter counts while keeping inference cost manageable.
The Full Transformer
Putting it all together: each transformer block consists of:
- 1. Multi-Head Self-Attention (with RoPE)
- 2. Residual Connection + Layer Norm
- 3. Feed-Forward Network (SwiGLU)
- 4. Residual Connection + Layer Norm
Stack N of these blocks, add embeddings at input and unembedding at output, and you have a modern large language model.
LayerNorm vs RMSNorm
Normalization is critical for training stability. LayerNorm normalizes across features:
LayerNorm: (x - μ) / σ · γ + β
RMSNorm: x / RMS(x) · γ (no mean centering)
RMSNorm is simpler (no mean computation) and often works as well, making it popular in modern architectures like LLaMA.
Sliding Window Attention
For very long sequences, even O(n²) with KV cache is too expensive. Sliding window attention limits each token to attending only within a local window:
Token i attends to [i-w, i] where w = window size
Memory: O(w) per token instead of O(n)
Information propagates across layers: with L layers and window w, effective context is L × w tokens. Used in Mistral and other efficient models.
State Space Models
SSMs model sequences as linear dynamical systems. At each step, a hidden state is updated based on the input and a learned transition matrix:
ht = Aht-1 + Bxt
yt = Cht + Dxt
The beauty is that these can be computed as a convolution during training (O(n log n)) while maintaining O(1) memory per step during inference.
Mamba: Selective State Spaces
Mamba's key insight: make the SSM parameters input-dependent. The matrices A, B, C become functions of the current input, enabling the model to selectively remember or forget information.
Bt = Linear(xt)
Δt = softplus(Linear(xt))
This breaks the convolution structure but enables hardware-efficient selective scanning that achieves linear scaling with sequence length.
Tradeoffs
| Method | Training | Inference | Long-range |
|---|---|---|---|
| Transformer | O(n²) | O(n²) | Direct |
| Linear SSM | O(n log n) | O(1)/step | Decay |
| Mamba | O(n) | O(1)/step | Selective |