Legacy Concept Lab

Contrastive Learning & InfoNCE

Powers CLIP, which enabled zero-shot image classification and text-to-image (via embeddings)

Concept 49 of 100RepresentationsPhase 5
#49ContrastiveRepresentations
key equation\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_k \exp(\text{sim}(z_i, z_k) / \tau)}
Phase 5: Representation & interpretabilityConcept 49 of 100

Why It Matters for Modern Models

  • Powers CLIP, which enabled zero-shot image classification and text-to-image (via embeddings)
  • Self-supervised learning breakthrough: learned ImageNet-quality features without labels
  • Same framework underlies sentence embeddings, audio-text models, and multimodal foundation models

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Temperature τ controls "hardness": low τ → focuses on hard negatives, high τ → uniform over negatives
  • Batch size matters: more negatives = better approximation of true InfoNCE = better representations
  • False negatives (same class treated as negative) hurt less than expected—contrastive is robust

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
L=logexp(sim(zi,zj+)/τ)kexp(sim(zi,zk)/τ)\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_k \exp(\text{sim}(z_i, z_k) / \tau)}

InfoNCE loss maximizes agreement between positive pairs while pushing negatives apart:

L=logexp(sim(zi,zj+)/τ)k=1Nexp(sim(zi,zk)/τ)\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_{k=1}^{N} \exp(\text{sim}(z_i, z_k) / \tau)}

where zj+z_j^+ is a positive (augmented view), others are negatives, and τ\tau is temperature.

CLIP extends this to image-text pairs:

LCLIP=12(Li2t+Lt2i)\mathcal{L}_{CLIP} = \frac{1}{2}\left( \mathcal{L}_{i2t} + \mathcal{L}_{t2i} \right)

matching images to their captions and vice versa in a batch.

Canonical Papers

A Simple Framework for Contrastive Learning of Visual Representations

Chen et al.2020ICML
Read paper →

Learning Transferable Visual Models From Natural Language Supervision

Radford et al.2021ICML
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.