Legacy Concept Lab
Contrastive Learning & InfoNCE
Powers CLIP, which enabled zero-shot image classification and text-to-image (via embeddings)
#49ContrastiveRepresentations
key equation
\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_k \exp(\text{sim}(z_i, z_k) / \tau)}Phase 5: Representation & interpretabilityConcept 49 of 100
Why It Matters for Modern Models
- Powers CLIP, which enabled zero-shot image classification and text-to-image (via embeddings)
- Self-supervised learning breakthrough: learned ImageNet-quality features without labels
- Same framework underlies sentence embeddings, audio-text models, and multimodal foundation models
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Temperature τ controls "hardness": low τ → focuses on hard negatives, high τ → uniform over negatives
- Batch size matters: more negatives = better approximation of true InfoNCE = better representations
- False negatives (same class treated as negative) hurt less than expected—contrastive is robust
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
InfoNCE loss maximizes agreement between positive pairs while pushing negatives apart:
where is a positive (augmented view), others are negatives, and is temperature.
CLIP extends this to image-text pairs:
matching images to their captions and vice versa in a batch.