In 2021, OpenAI's CLIP paper showed something powerful: train an image encoder and a text encoder jointly so that matching (image, caption) pairs land near each other in the same embedding space. Now you can search images with text queries, or vice versa, without per-class labels.
Every modern image generator, every vision-aware chat model, and every image search engine uses some descendant of this idea.
The trick
Two encoders. One eats images, one eats text. Both produce vectors of the same dimension. The training objective: vectors for matching (image, caption) pairs should be close; non-matching pairs should be far apart. Train on hundreds of millions of (image, caption) pairs scraped from the web.
The result: a shared semantic space where you can compare images and text directly.
What this unlocks
- Zero-shot classification. Want to know if an image contains a cat? Embed the image, embed the word "cat," compute distance. No training data needed.
- Image-conditioned generation. Diffusion models use CLIP text embeddings as steering signals.
- Vision in chat models. GPT-4V, Claude with vision, Gemini — they're transformers that take image-token sequences (produced by a CLIP-style encoder) alongside text tokens.
What it doesn't fix
- Counting. "How many people are in this image?" remains hard.
- Spatial reasoning. "What's to the left of the cat?" — patchy.
- Text in images. Reading text within images was added as a separate capability (OCR-style).
What to read next
Multimodal is the broader umbrella. Embeddings is the underlying primitive. Diffusion is the generative half of the vision-language ecosystem.