Legacy Concept Lab
Self-Supervised Learning: Labels from Structure
Foundation of modern NLP: BERT, GPT, LLaMA all use self-supervised pretraining
#65SSLRepresentations
key equation
\mathcal{L} = -\mathbb{E}[\log p(x_{masked} | x_{visible})]Phase 5: Representation & interpretabilityConcept 65 of 100
Why It Matters for Modern Models
- Foundation of modern NLP: BERT, GPT, LLaMA all use self-supervised pretraining
- Enables learning from internet-scale unlabeled data—the key to scaling laws
- Self-supervised vision (MAE, DINO) is closing the gap with supervised ImageNet pretraining
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- The task (predict missing parts) forces the model to understand structure and semantics
- SSL works because predicting tokens/pixels requires modeling the full data distribution
- Transfer learning magic: SSL features generalize because the pretext task is so hard
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Self-supervised = create supervision from data itself:
Masked Language Modeling (BERT):
where is the set of masked positions.
Next Token Prediction (GPT):
Contrastive (SimCLR, CLIP):