32Representations

🖼️Multimodal Foundations: Vision Encoders, Contrastive Learning & Cross-Attention Fusion

Canonical Papers

Learning Transferable Visual Models From Natural Language Supervision

Radford et al.2021ICML
Read paper →

SigLIP 2: Multilingual Vision-Language Encoders

Tschannen et al.2025arXiv
Read paper →

Contrastive Localized Language-Image Pre-Training

Chen et al.2024arXiv
Read paper →

Core Mathematics

Vision-Language Pretraining (VLP) trains image encoder + text encoder so images and text land in a shared space, enabling multimodal LLMs and diffusion conditioning.

Dual-encoder similarity (normalized embeddings):

ui=fI(Ii)fI(Ii),vj=fT(Tj)fT(Tj),sij=uivjτu_i = \frac{f_I(I_i)}{|f_I(I_i)|}, \quad v_j = \frac{f_T(T_j)}{|f_T(T_j)|}, \quad s_{ij} = \frac{u_i^\top v_j}{\tau}

Temperature-scaled cosine similarity between image and text embeddings.

CLIP-style contrastive loss (InfoNCE / symmetric cross-entropy):

L=12(logesiijesijlogesiijesji)\mathcal{L} = \frac{1}{2}\left( -\log\frac{e^{s_{ii}}}{\sum_j e^{s_{ij}}} - \log\frac{e^{s_{ii}}}{\sum_j e^{s_{ji}}} \right)

Maximize diagonal (matched pairs), minimize off-diagonal (negatives)—contrastive learning.

Cross-attention fusion (text attends to vision tokens):

CrossAttn(HT,HI)=softmax ⁣((HTWQ)(HIWK)d)(HIWV)\text{CrossAttn}(H_T, H_I) = \text{softmax}\!\left(\frac{(H_TW_Q)(H_IW_K)^\top}{\sqrt{d}}\right)(H_IW_V)

Enables grounding—text tokens can attend to image patches for dense understanding.

Key Equation
L=logesiijesijlogesiijesji\mathcal{L} = -\log\frac{e^{s_{ii}}}{\sum_j e^{s_{ij}}} - \log\frac{e^{s_{ii}}}{\sum_j e^{s_{ji}}}

Interactive Visualization

Why It Matters for Modern Models

  • CLIP-style encoders are front-end for multimodal LLMs (how images become tokens) and conditioning backbone for text-to-image diffusion
  • SigLIP 2 (2025): unified training recipe extending image-text with captioning, self-supervised losses, online curation—better encoders for VLMs
  • CLOC (2024): adds region-level objectives for localization/dense features—contrastive ≠ dense understanding, need explicit losses
  • After diffusion (#9) and representations (#10), this completes the bridge: how text embeddings condition generative vision models
  • Opens multimodal arc: sets up VLM safety/robustness, grounding evals, multimodal RLHF as natural next concepts

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Contrastive ≠ dense understanding—CLIP-like training yields global semantics, localization/dense features require extra losses or architecture
  • Negatives and batch size matter—contrastive training shaped by similarity matrix, diagonal dominance emerges from large batches
  • Fusion choice is capability choice—dual-encoder retrieval is cheap, cross-attention fusion enables richer grounding but costs compute
  • Temperature τ controls sharpness—low τ makes softmax peaked (hard negatives), high τ smooths (easier learning but less discriminative)
  • Shared semantic space is learned geometry—images and text don't naturally align, contrastive loss pulls matched pairs together, pushes mismatches apart

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.