Legacy Concept Lab

In-Context Learning: Learning Without Weight Updates

ICL is arguably THE signature capability of large language models—task adaptation without fine-tuning

Concept 37 of 100RepresentationsPhase 5

#37ICLRepresentations

key equation\hat{y} = \arg\max_y p_\theta(y \mid \text{examples}, x_{\text{query}})

Phase 5: Representation & interpretabilityConcept 37 of 100

Why It Matters for Modern Models

ICL is arguably THE signature capability of large language models—task adaptation without fine-tuning
Enables rapid prototyping and deployment: just change the prompt, not the model
Creates the "prompt engineering" paradigm and explains why few-shot examples matter

What is still poorly explained in textbooks and papers:

ICL emerges from scale—small models cannot do it; there appears to be a threshold around 1B+ parameters
Induction heads (copy-from-context circuits) are necessary but not sufficient for sophisticated ICL
ICL is not the same as memorization: models can interpolate to genuinely new tasks from demonstrations

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\hat{y} = \arg\max_y p_\theta(y \mid \text{examples}, x_{\text{query}})

In-context learning performs task adaptation through the prompt alone:

Given examples $(x_1, y_1), \ldots, (x_k, y_k)$ and query $x_{k+1}$ :

\hat{y}_{k+1} = \arg\max_y p_\theta(y \mid x_1, y_1, \ldots, x_k, y_k, x_{k+1})

No gradient updates to $\theta$ —the model "learns" by conditioning on demonstrations.

Mechanistic hypothesis: attention heads implement approximate gradient descent:

W_{\text{updated}} \approx W + \eta \sum_i (y_i - Wx_i)x_i^T

This emerges from the attention mechanism's ability to retrieve and aggregate relevant examples.

Brown et al.2020NeurIPS

Garg et al.2022NeurIPS

Explore this concept from different angles — like a mathematician would.