Legacy Concept Lab

In-Context Learning: Learning Without Weight Updates

ICL is arguably THE signature capability of large language models—task adaptation without fine-tuning

Concept 37 of 100RepresentationsPhase 5
#37ICLRepresentations
key equation\hat{y} = \arg\max_y p_\theta(y \mid \text{examples}, x_{\text{query}})
Phase 5: Representation & interpretabilityConcept 37 of 100

Why It Matters for Modern Models

  • ICL is arguably THE signature capability of large language models—task adaptation without fine-tuning
  • Enables rapid prototyping and deployment: just change the prompt, not the model
  • Creates the "prompt engineering" paradigm and explains why few-shot examples matter

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • ICL emerges from scale—small models cannot do it; there appears to be a threshold around 1B+ parameters
  • Induction heads (copy-from-context circuits) are necessary but not sufficient for sophisticated ICL
  • ICL is not the same as memorization: models can interpolate to genuinely new tasks from demonstrations

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
y^=argmaxypθ(yexamples,xquery)\hat{y} = \arg\max_y p_\theta(y \mid \text{examples}, x_{\text{query}})

In-context learning performs task adaptation through the prompt alone:

Given examples (x1,y1),,(xk,yk)(x_1, y_1), \ldots, (x_k, y_k) and query xk+1x_{k+1}:

y^k+1=argmaxypθ(yx1,y1,,xk,yk,xk+1)\hat{y}_{k+1} = \arg\max_y p_\theta(y \mid x_1, y_1, \ldots, x_k, y_k, x_{k+1})

No gradient updates to θ\theta—the model "learns" by conditioning on demonstrations.

Mechanistic hypothesis: attention heads implement approximate gradient descent:

WupdatedW+ηi(yiWxi)xiTW_{\text{updated}} \approx W + \eta \sum_i (y_i - Wx_i)x_i^T

This emerges from the attention mechanism's ability to retrieve and aggregate relevant examples.

Canonical Papers

Language Models are Few-Shot Learners

Brown et al.2020NeurIPS
Read paper →

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Garg et al.2022NeurIPS
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.