Multimodal Foundations: Vision Encoders, Contrastive Learning & Cross-Attention Fusion
Canonical Papers
Learning Transferable Visual Models From Natural Language Supervision
Read paper →SigLIP 2: Multilingual Vision-Language Encoders
Read paper →Contrastive Localized Language-Image Pre-Training
Read paper →Core Mathematics
Vision-Language Pretraining (VLP) trains image encoder + text encoder so images and text land in a shared space, enabling multimodal LLMs and diffusion conditioning.
Dual-encoder similarity (normalized embeddings):
Temperature-scaled cosine similarity between image and text embeddings.
CLIP-style contrastive loss (InfoNCE / symmetric cross-entropy):
Maximize diagonal (matched pairs), minimize off-diagonal (negatives)—contrastive learning.
Cross-attention fusion (text attends to vision tokens):
Enables grounding—text tokens can attend to image patches for dense understanding.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- CLIP-style encoders are front-end for multimodal LLMs (how images become tokens) and conditioning backbone for text-to-image diffusion
- SigLIP 2 (2025): unified training recipe extending image-text with captioning, self-supervised losses, online curation—better encoders for VLMs
- CLOC (2024): adds region-level objectives for localization/dense features—contrastive ≠ dense understanding, need explicit losses
- After diffusion (#9) and representations (#10), this completes the bridge: how text embeddings condition generative vision models
- Opens multimodal arc: sets up VLM safety/robustness, grounding evals, multimodal RLHF as natural next concepts
Missing Intuition
What is still poorly explained in textbooks and papers:
- Contrastive ≠ dense understanding—CLIP-like training yields global semantics, localization/dense features require extra losses or architecture
- Negatives and batch size matter—contrastive training shaped by similarity matrix, diagonal dominance emerges from large batches
- Fusion choice is capability choice—dual-encoder retrieval is cheap, cross-attention fusion enables richer grounding but costs compute
- Temperature τ controls sharpness—low τ makes softmax peaked (hard negatives), high τ smooths (easier learning but less discriminative)
- Shared semantic space is learned geometry—images and text don't naturally align, contrastive loss pulls matched pairs together, pushes mismatches apart