Vision-Language Models

In 2021, OpenAI's CLIP paper showed something powerful: train an image encoder and a text encoder jointly so that matching (image, caption) pairs land near each other in the same embedding space. Now you can search images with text queries, or vice versa, without per-class labels.

Every modern image generator, every vision-aware chat model, and every image search engine uses some descendant of this idea.

Image (SVG placeholder)

User question

What trend do you notice in this revenue chart?

Model response

The bar chart shows quarterly revenue increasing from Q1 to Q2, then dropping sharply in Q3 — the shortest bar, roughly half the Q2 height. Q4 recovers but does not reach Q2 levels. The Q3 drop is the most notable signal here.

The model processes image patches like tokens. The attention overlay shows which regions it weighted most heavily when forming its response.

The trick

Two encoders. One eats images, one eats text. Both produce vectors of the same dimension. The training objective: vectors for matching (image, caption) pairs should be close; non-matching pairs should be far apart. Train on hundreds of millions of (image, caption) pairs scraped from the web.

The result: a shared semantic space where you can compare images and text directly.

What this unlocks

Zero-shot classification. Want to know if an image contains a cat? Embed the image, embed the word "cat," compute distance. No training data needed.
Image-conditioned generation. Diffusion models use CLIP text embeddings as steering signals.
Vision in chat models. GPT-4V, Claude with vision, Gemini — they're transformers that take image-token sequences (produced by a CLIP-style encoder) alongside text tokens.

What it doesn't fix

Counting. "How many people are in this image?" remains hard.
Spatial reasoning. "What's to the left of the cat?" — patchy.
Text in images. Reading text within images was added as a separate capability (OCR-style).

What to read next

Multimodal is the broader umbrella. Embeddings is the underlying primitive. Diffusion is the generative half of the vision-language ecosystem.

The trick

The result: a shared semantic space where you can compare images and text directly.

What this unlocks

Zero-shot classification. Want to know if an image contains a cat? Embed the image, embed the word "cat," compute distance. No training data needed.

Image-conditioned generation. Diffusion models use CLIP text embeddings as steering signals.

Vision in chat models. GPT-4V, Claude with vision, Gemini — they're transformers that take image-token sequences (produced by a CLIP-style encoder) alongside text tokens.