Legacy Concept Lab
In-Context Learning: Learning Without Weight Updates
ICL is arguably THE signature capability of large language models—task adaptation without fine-tuning
#37ICLRepresentations
key equation
\hat{y} = \arg\max_y p_\theta(y \mid \text{examples}, x_{\text{query}})Phase 5: Representation & interpretabilityConcept 37 of 100
Why It Matters for Modern Models
- ICL is arguably THE signature capability of large language models—task adaptation without fine-tuning
- Enables rapid prototyping and deployment: just change the prompt, not the model
- Creates the "prompt engineering" paradigm and explains why few-shot examples matter
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- ICL emerges from scale—small models cannot do it; there appears to be a threshold around 1B+ parameters
- Induction heads (copy-from-context circuits) are necessary but not sufficient for sophisticated ICL
- ICL is not the same as memorization: models can interpolate to genuinely new tasks from demonstrations
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
In-context learning performs task adaptation through the prompt alone:
Given examples and query :
No gradient updates to —the model "learns" by conditioning on demonstrations.
Mechanistic hypothesis: attention heads implement approximate gradient descent:
This emerges from the attention mechanism's ability to retrieve and aggregate relevant examples.