Legacy Concept Lab

Logit Lens: Probing Intermediate Representations

First simple tool for "seeing inside" transformers—reveals layer-by-layer computation

Concept 46 of 100RepresentationsPhase 5
#46Logit LensRepresentations
key equation\text{logits}^{(l)} = W_U \cdot h^{(l)}
Phase 5: Representation & interpretabilityConcept 46 of 100

Why It Matters for Modern Models

  • First simple tool for "seeing inside" transformers—reveals layer-by-layer computation
  • Shows that early layers often predict related tokens, later layers refine to the final answer
  • Foundation for activation patching and circuit analysis techniques

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • The unembedding matrix acts as a "universal probe"—no training required, just matrix multiply
  • Not all layers show sensible tokens: some layers store information in non-token-interpretable ways
  • Tuned lens (learned affine per layer) often works better than raw logit lens

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
logits(l)=WUh(l)\text{logits}^{(l)} = W_U \cdot h^{(l)}

Logit lens applies the unembedding matrix to intermediate residual stream states:

logits(l)=WUh(l)\text{logits}^{(l)} = W_U \cdot h^{(l)}

where WUW_U is the unembedding matrix and h(l)h^{(l)} is the residual stream at layer ll.

This reveals what token the model would predict if it "stopped" at layer ll:

p(l)(token)=softmax(logits(l))p^{(l)}(\text{token}) = \text{softmax}(\text{logits}^{(l)})

The progression p(0)p(L)p^{(0)} \to p^{(L)} shows how the prediction evolves through the network.

Canonical Papers

interpreting GPT: the logit lens

nostalgebraist2020LessWrong
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.