Legacy Concept Lab

Contrastive Learning & InfoNCE

Powers CLIP, which enabled zero-shot image classification and text-to-image (via embeddings)

Concept 49 of 100RepresentationsPhase 5

#49ContrastiveRepresentations

key equation\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_k \exp(\text{sim}(z_i, z_k) / \tau)}

Phase 5: Representation & interpretabilityConcept 49 of 100

Why It Matters for Modern Models

Powers CLIP, which enabled zero-shot image classification and text-to-image (via embeddings)
Self-supervised learning breakthrough: learned ImageNet-quality features without labels
Same framework underlies sentence embeddings, audio-text models, and multimodal foundation models

What is still poorly explained in textbooks and papers:

Temperature τ controls "hardness": low τ → focuses on hard negatives, high τ → uniform over negatives
Batch size matters: more negatives = better approximation of true InfoNCE = better representations
False negatives (same class treated as negative) hurt less than expected—contrastive is robust

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_k \exp(\text{sim}(z_i, z_k) / \tau)}

InfoNCE loss maximizes agreement between positive pairs while pushing negatives apart:

\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_{k=1}^{N} \exp(\text{sim}(z_i, z_k) / \tau)}

where $z_j^+$ is a positive (augmented view), others are negatives, and $\tau$ is temperature.

CLIP extends this to image-text pairs:

\mathcal{L}_{CLIP} = \frac{1}{2}\left( \mathcal{L}_{i2t} + \mathcal{L}_{t2i} \right)

matching images to their captions and vice versa in a batch.

Chen et al.2020ICML

Radford et al.2021ICML

Explore this concept from different angles — like a mathematician would.