Contrastive Language-Image Pre-training
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
Read the Paper on arXivCLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision, enabling zero-shot transfer to downstream tasks without any task-specific training data. Published by OpenAI in January 2021, it fundamentally changed how the field thinks about connecting vision and language.
The core idea replaces hand-labeled datasets (like ImageNet's 1.4M images across 1,000 classes, which took over 2.5 years of human annotation) with 400 million image-text pairs collected from the internet into a dataset called WebImageText (WIT). A dual-encoder architecture learns to align images and their captions in a shared 512-dimensional embedding space via a symmetric contrastive objective called InfoNCE. The image encoder (ViT-L/14 with 428M parameters) and the text encoder (a 63M-parameter Transformer) are trained jointly but process their inputs independently, producing embeddings that are compared via cosine similarity.
Key advances:
CLIP's zero-shot classifier matches a fully supervised ResNet-50 on ImageNet (76.2% top-1 accuracy) without seeing a single ImageNet training image. On 16 of 27 evaluation datasets, zero-shot CLIP outperforms supervised linear probes. The learned embeddings became the foundation for DALL-E 2, Stable Diffusion, and virtually every subsequent vision-language model. Training required 256 V100 GPUs for approximately 12 days, processing 12.8 billion image-text comparisons across 32 epochs.
Click any topic to jump in
Symmetric InfoNCE over an N×N image-text similarity matrix — diagonals pulled together, off-diagonals pushed apart.
Separate image and text towers project into a unified L2-normalized embedding space — cosine similarity becomes the classifier.
Text embeddings of class prompts act as a zero-gradient linear classifier — new classes added for free.
Averaging multiple prompt templates per class estimates a denoised class centroid — a free ~3.5% accuracy boost.
Frozen CLIP features outperform supervised features on 20+ downstream tasks under linear probing.
Error follows a power law in compute — 400M pairs + large batches are what unlock zero-shot generalization.
Upgrade to PixelBank Premium to unlock this content.