Vision-Language2021

CLIP

Contrastive Language-Image Pre-training

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Read the Paper on arXiv

Paper Overview

CLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision, enabling zero-shot transfer to downstream tasks without any task-specific training data. Published by OpenAI in January 2021, it fundamentally changed how the field thinks about connecting vision and language.

The core idea replaces hand-labeled datasets (like ImageNet's 1.4M images across 1,000 classes, which took over 2.5 years of human annotation) with 400 million image-text pairs collected from the internet into a dataset called WebImageText (WIT). A dual-encoder architecture learns to align images and their captions in a shared 512-dimensional embedding space via a symmetric contrastive objective called InfoNCE. The image encoder (ViT-L/14 with 428M parameters) and the text encoder (a 63M-parameter Transformer) are trained jointly but process their inputs independently, producing embeddings that are compared via cosine similarity.

Key advances:

Contrastive pre-training: Match images to captions in an NxN similarity matrix — N² training signal from just N examples. With batch size 32,768, each example is compared against 32,767 negatives per step
Dual encoder architecture: Separate image encoder (ViT-L/14: 24 layers, 1024-d, 16 heads) and text encoder (12 layers, 512-d, 8 heads) mapping to a shared 512-d space via learned linear projections
Zero-shot transfer: Classify any image using text prompts like "a photo of a {class}" — no labeled training data needed. Matches a fully supervised ResNet-50 on ImageNet at 76.2% top-1 accuracy
Prompt engineering: 80 text templates per class, ensembled by averaging embeddings, improve accuracy by +3.5% (68.7% to 72.2%) through disambiguation and distribution coverage
Scaling: Performance improves log-linearly with both compute and data. ViT-L/14 achieves 75.5% zero-shot on ImageNet; with 336px resolution, 76.2%. No saturation observed at 400M pairs
Robust representations: Strong transfer across 27 diverse datasets spanning fine-grained recognition, scene classification, action recognition, satellite imagery, OCR, and texture classification

CLIP's zero-shot classifier matches a fully supervised ResNet-50 on ImageNet (76.2% top-1 accuracy) without seeing a single ImageNet training image. On 16 of 27 evaluation datasets, zero-shot CLIP outperforms supervised linear probes. The learned embeddings became the foundation for DALL-E 2, Stable Diffusion, and virtually every subsequent vision-language model. Training required 256 V100 GPUs for approximately 12 days, processing 12.8 billion image-text comparisons across 32 epochs.

Chapter Roadmap

Click any topic to jump in

Contrastive Pre-training

Symmetric InfoNCE over an N×N image-text similarity matrix — diagonals pulled together, off-diagonals pushed apart.

Dual Encoder Architecture

Separate image and text towers project into a unified L2-normalized embedding space — cosine similarity becomes the classifier.

Contrastive loss needs an architecture to embed both modalities

Zero-Shot Transfer

Text embeddings of class prompts act as a zero-gradient linear classifier — new classes added for free.

Prompt Engineering & Ensembling

Averaging multiple prompt templates per class estimates a denoised class centroid — a free ~3.5% accuracy boost.

Shared embedding space unlocks zero-shot and prompt tricks

Representation Quality

Frozen CLIP features outperform supervised features on 20+ downstream tasks under linear probing.

Data & Scaling

Error follows a power law in compute — 400M pairs + large batches are what unlock zero-shot generalization.

CLIP learns visual concepts by training on 400M image-text pairs from the web, using a symmetric contrastive loss (InfoNCE) that maximizes cosine similarity between matched image-text pairs and minimizes it for all mismatched combinations within each batch of 32,768 examples.

The Problem

The labeling bottleneck:

Supervised learning requires manually labeled datasets. ImageNet took over 2.5 years and 49,000 workers on Amazon Mechanical Turk to label 1.4M images across 1,000 classes. Scaling to more classes or domains requires proportionally more labels — and the label taxonomy itself is a bottleneck. ImageNet's 1,000 classes are a specific, somewhat arbitrary slice of visual concepts (205 dog breeds but no "sunset" or "protest").

Why existing approaches fall short:

Supervised classification: Fixed set of classes defined at training time. Adding a new class means collecting and labeling more data, redefining the softmax head, and retraining. Each dataset encodes a narrow taxonomy that limits what the model can recognize.
Captioning models (VirTex): Predict exact captions word-by-word using autoregressive decoding. This is expensive per example (sequential token generation) and fragile — small vocabulary changes or paraphrasing break the training signal. VirTex trained on only 118K COCO images and showed limited transfer.
ICMLM (Image-Conditioned Masked Language Model): Used a BERT-style masked language modeling objective conditioned on images. Trained on Conceptual Captions (3.3M pairs), it achieved only 11.5% zero-shot accuracy on ImageNet — far below usable performance.
Bag-of-words approaches: Predict which words appear in the caption (multi-label classification over a fixed vocabulary). Better data efficiency than captioning but limited to a fixed vocabulary and can't capture word order or compositional meaning.

The data efficiency problem:

With N image-text pairs, a generative captioning model gets exactly N training examples — one gradient signal per pair. A bag-of-words model gets slightly more by predicting multiple words, but still N examples. For contrastive learning, every image is compared against every text in the batch, giving N² pairwise comparisons from just N examples. With a batch size of 32,768, that is over 1 billion comparisons per batch — an extraordinary increase in learning signal per data point.

The natural language advantage:

The internet contains billions of image-text pairs (alt text, captions, titles, descriptions). This supervision is free, diverse, and naturally describes visual concepts in open-ended language — not constrained to a fixed label set. A single caption like "a golden retriever playing fetch on the beach at sunset" simultaneously teaches the model about dogs, beaches, activities, lighting conditions, and their co-occurrence patterns. The OpenAI team estimated that over 400 million such pairs could be collected with sufficient quality filtering.

The Solution

Contrastive pre-training learns by matching images to their captions within a batch, treating all non-matching pairs as negatives:

Step 1: Sample a batch of N = 32,768 image-text pairs $(x_1, t_1), ..., (x_N, t_N)$

Each pair consists of an image and its associated caption from the WIT dataset. The large batch size is critical: it provides 32,767 negative examples per positive pair, forcing the model to make fine-grained distinctions.

Step 2: Encode each image and text independently through their respective encoders and linear projections:

Image path: $224 \times 224 \times 3$ image $\rightarrow$ ViT-L/14 (24 blocks, 1024-d) $\rightarrow$ [CLS] token $\in \mathbb{R}^{1024}$ $\rightarrow$ linear projection $W_i \in \mathbb{R}^{512 \times 1024}$ $\rightarrow$ L2-normalize: $z_i^{\text{img}} = \frac{W_i \cdot \text{ViT}(x_i)_{[\text{CLS}]}}{\|W_i \cdot \text{ViT}(x_i)_{[\text{CLS}]}\|} \in \mathbb{R}^{512}$

Text path: tokenized caption (max 76 BPE tokens) $\rightarrow$ Transformer (12 blocks, 512-d) $\rightarrow$ [EOS] token $\in \mathbb{R}^{512}$ $\rightarrow$ linear projection $W_t \in \mathbb{R}^{512 \times 512}$ $\rightarrow$ L2-normalize: $z_i^{\text{txt}} = \frac{W_t \cdot \text{Transformer}(t_i)_{[\text{EOS}]}}{\|W_t \cdot \text{Transformer}(t_i)_{[\text{EOS}]}\|} \in \mathbb{R}^{512}$

This produces two matrices of embeddings: $Z^{\text{img}} \in \mathbb{R}^{N \times 512}$ and $Z^{\text{txt}} \in \mathbb{R}^{N \times 512}$ .

Step 3: Compute the NxN cosine similarity matrix scaled by a learned temperature: $s_{ij} = \frac{z_i^{\text{img}} \cdot z_j^{\text{txt}}}{\|z_i^{\text{img}}\| \, \|z_j^{\text{txt}}\|} \cdot \exp(\tau)$

Since embeddings are already L2-normalized, the dot product equals cosine similarity. The temperature $\tau$ is a learned log-parameterized scalar (initialized at $\log(1/0.07) \approx 2.66$ ) that controls the sharpness of the softmax distribution. It is clamped to prevent $\exp(\tau)$ from exceeding 100, which would cause training instability. The resulting matrix $S \in \mathbb{R}^{N \times N}$ has $N^2 = 1,073,741,824$ entries for $N = 32{,}768$ .

Step 4: Symmetric contrastive loss (InfoNCE):

The diagonal entries $s_{ii}$ are the positive (matched) pairs. All off-diagonal entries are negatives. The loss treats each row as an N-way classification problem (which text matches this image?) and each column as the reverse (which image matches this text?):

$\mathcal{L}_{\text{img}} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{ii})}{\sum_{j=1}^{N} \exp(s_{ij})}$

$\mathcal{L}_{\text{txt}} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{ii})}{\sum_{j=1}^{N} \exp(s_{ji})}$

$\mathcal{L} = \frac{1}{2}(\mathcal{L}_{\text{img}} + \mathcal{L}_{\text{txt}})$

Why contrastive beats generative and bag-of-words:

The authors directly compared three objectives on the same data and compute budget. Contrastive learning achieved 4x higher zero-shot ImageNet accuracy than the equivalent predictive (captioning) model, reaching usable accuracy (over 60%) within the training budget while the generative approach plateaued around 16%. The bag-of-words contrastive approach was nearly as good but lost compositional understanding ("a dog biting a man" vs "a man biting a dog").

Why batch size matters:

With a batch size of 32,768, each example has 32,767 negatives. This massive negative set forces the model to learn fine-grained visual-semantic distinctions — it cannot succeed by learning coarse categories alone. The model must distinguish "a golden retriever on a beach" from "a labrador on a beach," "a golden retriever in a park," and 32,765 other captions. Smaller batch sizes (e.g., 256) dramatically reduce performance because the contrastive signal becomes too easy.

Key Points

400M image-text pairs from WebImageText (WIT) dataset — no manual labels needed, scraped from internet alt-text and captions

NxN batch yields N² comparisons: with batch size 32,768, that is over 1 billion pairwise comparisons per step

Symmetric InfoNCE loss: image-to-text + text-to-image cross-entropy, averaged. Each direction treats it as N-way classification

Learned temperature $\tau$ initialized at $\log(1/0.07)$ , clamped to prevent $\exp(\tau) > 100$ . Controls softmax sharpness — higher means more peaked distributions

Contrastive objective achieves 4x the zero-shot accuracy of generative captioning objectives on the same data and compute budget

Embeddings are L2-normalized before comparison, so dot product equals cosine similarity. The shared 512-d space holds both modalities

Mathematical Formulation

Symmetric Contrastive Loss (InfoNCE)

$\mathcal{L} = -\frac{1}{2N}\sum_i \left[\log \frac{\exp(s_{ii})}{\sum_j \exp(s_{ij})} + \log \frac{\exp(s_{ii})}{\sum_j \exp(s_{ji})}\right]$

For each of the N image-text pairs, the diagonal entry (matched pair) competes against all N-1 mismatched pairs in its row (image-to-text) and column (text-to-image). The loss is the average of both directions, ensuring neither modality dominates. Minimizing this loss pushes matched pairs to cosine similarity 1.0 and mismatched pairs toward -1.0.

Temperature-Scaled Cosine Similarity

$s_{ij} = \cos(z_i^{\text{img}}, z_j^{\text{txt}}) \cdot \exp(\tau)$

Cosine similarity in [-1, 1] is scaled by $\exp(\tau)$ where $\tau$ is a learned scalar. At initialization ($\tau = \log(1/0.07)$), the scale factor is ~14.3, amplifying small cosine differences into large logit differences. This makes the softmax distribution sharper and the contrastive signal stronger. The temperature is learned end-to-end and clamped to prevent numerical instability.

Mathematical Intuition

Given a batch of $N$ image-text pairs, CLIP forms an $N \times N$ similarity matrix $S_{ij} = \text{sim}(I_i, T_j) / \tau$ and applies a symmetric InfoNCE loss. The diagonal (matched pairs) is pulled up; the $N^2 - N$ off-diagonal entries are pushed down. The temperature $\tau$ acts as an inverse softmax sharpness — lower $\tau$ makes the model more discriminative but also more unstable. Crucially, the loss scales as $\log N$ , so enlarging the batch gives exponentially more negatives per step.

Architecture

Data flow through contrastive pre-training (one step):

Batch of 32,768 image-text pairs $\downarrow$ Image branch: $[32768, 224, 224, 3]$ $\xrightarrow{\text{patchify}}$ $[32768, 196, 768]$ $\xrightarrow{\text{ViT-L/14 (24 blocks)}}$ $[32768, 1024]$ (CLS) $\xrightarrow{W_i}$ $[32768, 512]$ $\xrightarrow{\text{L2-norm}}$ $Z^{\text{img}}$ Text branch: $[32768, 76]$ tokens $\xrightarrow{\text{Transformer (12 blocks)}}$ $[32768, 512]$ (EOS) $\xrightarrow{W_t}$ $[32768, 512]$ $\xrightarrow{\text{L2-norm}}$ $Z^{\text{txt}}$ $\downarrow$ Similarity matrix: $S = Z^{\text{img}} \cdot (Z^{\text{txt}})^T \cdot \exp(\tau) \in \mathbb{R}^{32768 \times 32768}$ $\downarrow$ InfoNCE loss on rows (image-to-text) + columns (text-to-image) $\downarrow$ Gradients flow back through both encoders and projections

Premium Content

Upgrade to PixelBank Premium to unlock this content.

Back to All Papers

Study Plans

VLM Study Plan All Papers Practice Problems

Contrastive pre-training learns by matching images to their captions within a batch, treating all non-matching pairs as negatives:

Step 1: Sample a batch of N = 32,768 image-text pairs $(x_1, t_1), ..., (x_N, t_N)$

Step 2: Encode each image and text independently through their respective encoders and linear projections:

This produces two matrices of embeddings: $Z^{\text{img}} \in \mathbb{R}^{N \times 512}$ and $Z^{\text{txt}} \in \mathbb{R}^{N \times 512}$ .

Step 4: Symmetric contrastive loss (InfoNCE):

$\mathcal{L}_{\text{img}} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{ii})}{\sum_{j=1}^{N} \exp(s_{ij})}$

$\mathcal{L}_{\text{txt}} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{ii})}{\sum_{j=1}^{N} \exp(s_{ji})}$

$\mathcal{L} = \frac{1}{2}(\mathcal{L}_{\text{img}} + \mathcal{L}_{\text{txt}})$

Why contrastive beats generative and bag-of-words:

Why batch size matters:

CLIP

Paper Overview

Chapter Roadmap

Contrastive Pre-training

Dual Encoder Architecture

Zero-Shot Transfer

Prompt Engineering & Ensembling

Representation Quality

Data & Scaling

Premium Content

Related Papers

Study Plans

CLIP

Paper Overview

Chapter Roadmap

Contrastive Pre-training

Dual Encoder Architecture

Zero-Shot Transfer

Prompt Engineering & Ensembling

Representation Quality

Data & Scaling

Premium Content

Related Papers

Study Plans