Sequence Modeling

From recurrence to attention to selective state spaces

The Challenge of Sequences

Sequential data is everywhere: language, audio, time series, code. The fundamental challenge is how to process variable-length inputs while maintaining computational efficiency and capturing long-range dependencies.

Three paradigms have emerged, each with distinct tradeoffs between expressiveness, parallelism, and memory complexity.

Self-Attention

The Transformer's key innovation: every token can directly attend to every other token. No recurrence needed. The attention weight from token i to token j is:

Attention Mechanism

αij = softmax(Qi · KjT / √d)

This enables parallel training but comes with O(n²) memory complexity in sequence length. For long sequences, this becomes prohibitive.

Temperature:1.0

RoPE: Rotary Position Embeddings

How do transformers know token order? RoPE encodes position by rotating query and key vectors in 2D subspaces. Each position gets a unique rotation angle:

Rotary Embedding

R(θm) = [cos(mθ), -sin(mθ); sin(mθ), cos(mθ)]

The key insight: relative positions are encoded in the angle between rotated vectors, enabling length generalization beyond training context.

KV Cache

During autoregressive generation, we only need to compute attention for the new token. The KV cache stores previously computed key and value vectors:

Incremental Decoding

At step t: Kcache = [K1, ..., Kt-1]
Only compute Qt, Kt, Vt for new token

This reduces per-token compute from O(n²) to O(n), but requires O(n·d) memory per layer. For long contexts, this memory cost dominates.

Grouped Query Attention

Multi-Head Attention (MHA) uses separate K, V heads per query head. GQA shares K, V across groups of query heads, reducing KV cache size:

GQA Memory Savings

MHA: nheads × dhead for K and V each
GQA: ngroups × dhead (where groups < heads)

MQA (Multi-Query Attention) is the extreme case: all query heads share a single K, V pair. GQA balances memory efficiency with model quality.

SwiGLU Activation

The feed-forward network in each transformer block uses gated activations. SwiGLU combines Swish activation with a gating mechanism:

SwiGLU

SwiGLU(x) = Swish(xW1) ⊙ (xW2)
where Swish(x) = x · σ(x)

The gating allows the network to selectively pass or block information, improving training stability and final performance over ReLU.

Mixture of Experts

Scale model capacity without scaling compute. MoE replaces the FFN with multiple "expert" networks, routing each token to a subset:

Expert Routing

y = Σ G(x)i · Experti(x)
where G(x) = TopK(softmax(x · Wgate))

Typically only 1-2 experts are active per token (sparse routing), achieving massive parameter counts while keeping inference cost manageable.

The Full Transformer

Putting it all together: each transformer block consists of:

  • 1. Multi-Head Self-Attention (with RoPE)
  • 2. Residual Connection + Layer Norm
  • 3. Feed-Forward Network (SwiGLU)
  • 4. Residual Connection + Layer Norm

Stack N of these blocks, add embeddings at input and unembedding at output, and you have a modern large language model.

LayerNorm vs RMSNorm

Normalization is critical for training stability. LayerNorm normalizes across features:

Normalization

LayerNorm: (x - μ) / σ · γ + β
RMSNorm: x / RMS(x) · γ (no mean centering)

RMSNorm is simpler (no mean computation) and often works as well, making it popular in modern architectures like LLaMA.

Sliding Window Attention

For very long sequences, even O(n²) with KV cache is too expensive. Sliding window attention limits each token to attending only within a local window:

Local Attention

Token i attends to [i-w, i] where w = window size
Memory: O(w) per token instead of O(n)

Information propagates across layers: with L layers and window w, effective context is L × w tokens. Used in Mistral and other efficient models.

State Space Models

SSMs model sequences as linear dynamical systems. At each step, a hidden state is updated based on the input and a learned transition matrix:

SSM Recurrence

ht = Aht-1 + Bxt
yt = Cht + Dxt

The beauty is that these can be computed as a convolution during training (O(n log n)) while maintaining O(1) memory per step during inference.

Selectivity:0.50

Mamba: Selective State Spaces

Mamba's key insight: make the SSM parameters input-dependent. The matrices A, B, C become functions of the current input, enabling the model to selectively remember or forget information.

Selective SSM

Bt = Linear(xt)
Δt = softplus(Linear(xt))

This breaks the convolution structure but enables hardware-efficient selective scanning that achieves linear scaling with sequence length.

Tradeoffs

MethodTrainingInferenceLong-range
TransformerO(n²)O(n²)Direct
Linear SSMO(n log n)O(1)/stepDecay
MambaO(n)O(1)/stepSelective