Legacy Concept Lab

Self-Supervised Learning: Labels from Structure

Foundation of modern NLP: BERT, GPT, LLaMA all use self-supervised pretraining

Concept 65 of 100RepresentationsPhase 5

#65SSLRepresentations

key equation\mathcal{L} = -\mathbb{E}[\log p(x_{masked} | x_{visible})]

Phase 5: Representation & interpretabilityConcept 65 of 100

Why It Matters for Modern Models

Foundation of modern NLP: BERT, GPT, LLaMA all use self-supervised pretraining
Enables learning from internet-scale unlabeled data—the key to scaling laws
Self-supervised vision (MAE, DINO) is closing the gap with supervised ImageNet pretraining

What is still poorly explained in textbooks and papers:

The task (predict missing parts) forces the model to understand structure and semantics
SSL works because predicting tokens/pixels requires modeling the full data distribution
Transfer learning magic: SSL features generalize because the pretext task is so hard

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\mathcal{L} = -\mathbb{E}[\log p(x_{masked} | x_{visible})]

Self-supervised = create supervision from data itself:

Masked Language Modeling (BERT):

\mathcal{L}_{MLM} = -\mathbb{E}_{x, M}\left[\sum_{i \in M} \log p(x_i | x_{\backslash M})\right]

where $M$ is the set of masked positions.

Next Token Prediction (GPT):

\mathcal{L}_{NTP} = -\sum_{t=1}^T \log p(x_t | x_{<t})

Contrastive (SimCLR, CLIP):

\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_i^+)/\tau)}{\sum_k \exp(\text{sim}(z_i, z_k)/\tau)}

Devlin et al.2019NAACL

He et al.2022CVPR

Explore this concept from different angles — like a mathematician would.