Legacy Concept Lab

Self-Supervised Learning: Labels from Structure

Foundation of modern NLP: BERT, GPT, LLaMA all use self-supervised pretraining

Concept 65 of 100RepresentationsPhase 5
#65SSLRepresentations
key equation\mathcal{L} = -\mathbb{E}[\log p(x_{masked} | x_{visible})]
Phase 5: Representation & interpretabilityConcept 65 of 100

Why It Matters for Modern Models

  • Foundation of modern NLP: BERT, GPT, LLaMA all use self-supervised pretraining
  • Enables learning from internet-scale unlabeled data—the key to scaling laws
  • Self-supervised vision (MAE, DINO) is closing the gap with supervised ImageNet pretraining

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • The task (predict missing parts) forces the model to understand structure and semantics
  • SSL works because predicting tokens/pixels requires modeling the full data distribution
  • Transfer learning magic: SSL features generalize because the pretext task is so hard

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
L=E[logp(xmaskedxvisible)]\mathcal{L} = -\mathbb{E}[\log p(x_{masked} | x_{visible})]

Self-supervised = create supervision from data itself:

Masked Language Modeling (BERT):

LMLM=Ex,M[iMlogp(xix\M)]\mathcal{L}_{MLM} = -\mathbb{E}_{x, M}\left[\sum_{i \in M} \log p(x_i | x_{\backslash M})\right]

where MM is the set of masked positions.

Next Token Prediction (GPT):

LNTP=t=1Tlogp(xtx<t)\mathcal{L}_{NTP} = -\sum_{t=1}^T \log p(x_t | x_{<t})

Contrastive (SimCLR, CLIP):

L=logexp(sim(zi,zi+)/τ)kexp(sim(zi,zk)/τ)\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_i^+)/\tau)}{\sum_k \exp(\text{sim}(z_i, z_k)/\tau)}

Canonical Papers

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin et al.2019NAACL
Read paper →

Masked Autoencoders Are Scalable Vision Learners

He et al.2022CVPR
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.