Legacy Concept Lab
Logit Lens: Probing Intermediate Representations
First simple tool for "seeing inside" transformers—reveals layer-by-layer computation
#46Logit LensRepresentations
key equation
\text{logits}^{(l)} = W_U \cdot h^{(l)}Phase 5: Representation & interpretabilityConcept 46 of 100
Why It Matters for Modern Models
- First simple tool for "seeing inside" transformers—reveals layer-by-layer computation
- Shows that early layers often predict related tokens, later layers refine to the final answer
- Foundation for activation patching and circuit analysis techniques
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- The unembedding matrix acts as a "universal probe"—no training required, just matrix multiply
- Not all layers show sensible tokens: some layers store information in non-token-interpretable ways
- Tuned lens (learned affine per layer) often works better than raw logit lens
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Logit lens applies the unembedding matrix to intermediate residual stream states:
where is the unembedding matrix and is the residual stream at layer .
This reveals what token the model would predict if it "stopped" at layer :
The progression shows how the prediction evolves through the network.