Legacy Concept Lab

Logit Lens: Probing Intermediate Representations

First simple tool for "seeing inside" transformers—reveals layer-by-layer computation

Concept 46 of 100RepresentationsPhase 5

#46Logit LensRepresentations

key equation\text{logits}^{(l)} = W_U \cdot h^{(l)}

Phase 5: Representation & interpretabilityConcept 46 of 100

Why It Matters for Modern Models

First simple tool for "seeing inside" transformers—reveals layer-by-layer computation
Shows that early layers often predict related tokens, later layers refine to the final answer
Foundation for activation patching and circuit analysis techniques

What is still poorly explained in textbooks and papers:

The unembedding matrix acts as a "universal probe"—no training required, just matrix multiply
Not all layers show sensible tokens: some layers store information in non-token-interpretable ways
Tuned lens (learned affine per layer) often works better than raw logit lens

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\text{logits}^{(l)} = W_U \cdot h^{(l)}

Logit lens applies the unembedding matrix to intermediate residual stream states:

\text{logits}^{(l)} = W_U \cdot h^{(l)}

where $W_U$ is the unembedding matrix and $h^{(l)}$ is the residual stream at layer $l$ .

This reveals what token the model would predict if it "stopped" at layer $l$ :

p^{(l)}(\text{token}) = \text{softmax}(\text{logits}^{(l)})

The progression $p^{(0)} \to p^{(L)}$ shows how the prediction evolves through the network.

nostalgebraist2020LessWrong

Explore this concept from different angles — like a mathematician would.