Diffusion Models

Every AI-generated image you've seen — Midjourney portraits, Stable Diffusion landscapes, that thing your cousin keeps posting — was made by a model that learned to start with random noise and gradually find signal.

This is the core idea behind models: rather than learning to draw images directly, they learn to denoise them. The training process is deceptively simple. The interesting thing is what the model learns along the way.

Step 1 of 8

Pure noise

The model starts here — pure random static

Noise level100%

Each step removes a tiny bit of noise. After 8 steps, structure emerges. A real model does this in 50–1000 steps.

How the training works

The insight comes from asking a seemingly backward question: instead of learning to create images, what if we learn to destroy them — and then reverse the process?

Start

Take a real photograph

›

Add noise

Gradually corrupt it with random Gaussian noise over ~1000 steps

›

Pure noise

After enough steps, just static — original image is unrecoverable

›

Train reversal

Model learns to predict and remove each noise step backward

›

Generate

Start from pure noise, reverse 1000 steps — image emerges

At training time, the model sees images at every stage of corruption and learns to predict: "given this noisy image at step T, what does the slightly less noisy version at step T−1 look like?"[·]

Repeat that 1000 times, and you have a path from pure noise to a coherent image.

INSIGHT

The model never sees the generation process during training — only the denoising step. This is what makes the math tractable. Training a model to generate images directly is extremely hard. Training one to predict noise is much simpler.

How text conditioning works

A diffusion model trained only on images can generate images — but random ones. To steer the generation toward a specific subject, style, or composition, the model needs to understand text.

This is where CLIP (Contrastive Language-Image Pretraining)[·] comes in. CLIP is trained separately to map images and text descriptions into the same mathematical space — so "a photograph of a golden retriever in autumn leaves" ends up close to the vector representation of images fitting that description.

During diffusion model training, text from CLIP are fed into each denoising step as conditioning information. The model learns: "when the text says X, steer the denoising toward images that look like X."

At generation time: type a prompt → encode with CLIP → use that encoding to guide 50–1000 denoising steps from random noise → image.

Why it produces coherent detail

One counterintuitive property of diffusion models: they produce remarkably coherent images even at fine detail, despite never having a global "plan" for the image.

The reason is that the denoising process operates hierarchically in practice. Early steps (high noise) establish large-scale structure — composition, rough shapes, colour relationships. Later steps (low noise) fill in fine details within that established structure.

Early denoising steps

Late denoising steps

Noise level

Very high — most of the image is static

Very low — mostly committed image with small noise

What changes

Large-scale composition, dominant colours, rough shapes

Fine texture, sharp edges, lighting details

Impact of error

High — a wrong early step affects everything downstream

Low — recoverable in subsequent steps

Text influence

Strong — high-level semantics guide structure

Subtle — detail refinement within established structure

This is also why (classifier-free guidance) matters. Higher guidance pushes the generation more strongly toward the text prompt at every step — which produces more "on-prompt" images but can cause oversaturation and artefacts.[·]

The model zoo

Diffusion models come in several flavours, and the differences matter for what you're building:

Stable Diffusion (and its variants — SDXL, SD3) is open-source and can be run locally. Enormous community ecosystem, extensive fine-tuned variants for specific styles. Slower than API options, but the only one where you truly control the weights.

Midjourney runs proprietary diffusion models tuned for aesthetic quality. The results often have a distinct look — high coherence, painterly quality — but you can't self-host it.

DALL-E 3 and GPT-Image-1 integrate tightly with GPT-4 for prompt understanding before generation, which means they handle complex compositional prompts better than raw CLIP conditioning.

TIP

For production use: DALL-E 3 handles complex compositions, Stable Diffusion gives you control and privacy, Midjourney produces the most immediately impressive aesthetics. Pick based on what your use case actually needs.

The underlying math is the same. The training data, fine-tuning, and guidance techniques are what differentiate them.

The path from pure noise to a coherent image is one of the strangest computations humans have built. Nothing in the model "imagines" the picture in advance — it just gets a little less wrong, a thousand times in a row, guided by what your prompt embedded into the space of all possible images.

The math is unromantic. The output, often, isn't.

Step 1 of 8

Pure noise

The model starts here — pure random static

Noise level100%

Each step removes a tiny bit of noise. After 8 steps, structure emerges. A real model does this in 50–1000 steps.

How the training works

The insight comes from asking a seemingly backward question: instead of learning to create images, what if we learn to destroy them — and then reverse the process?

Start

Take a real photograph

›

Add noise

Gradually corrupt it with random Gaussian noise over ~1000 steps

›

Pure noise

After enough steps, just static — original image is unrecoverable

›

Train reversal

Model learns to predict and remove each noise step backward

›

Generate

Start from pure noise, reverse 1000 steps — image emerges

Repeat that 1000 times, and you have a path from pure noise to a coherent image.

INSIGHT

How text conditioning works

A diffusion model trained only on images can generate images — but random ones. To steer the generation toward a specific subject, style, or composition, the model needs to understand text.

At generation time: type a prompt → encode with CLIP → use that encoding to guide 50–1000 denoising steps from random noise → image.

Why it produces coherent detail

One counterintuitive property of diffusion models: they produce remarkably coherent images even at fine detail, despite never having a global "plan" for the image.

Early denoising steps

Late denoising steps

Noise level

Very high — most of the image is static

Very low — mostly committed image with small noise

What changes

Large-scale composition, dominant colours, rough shapes

Fine texture, sharp edges, lighting details

Impact of error

High — a wrong early step affects everything downstream

Low — recoverable in subsequent steps

Text influence

Strong — high-level semantics guide structure

Subtle — detail refinement within established structure

The model zoo

Diffusion models come in several flavours, and the differences matter for what you're building:

Midjourney runs proprietary diffusion models tuned for aesthetic quality. The results often have a distinct look — high coherence, painterly quality — but you can't self-host it.

DALL-E 3 and GPT-Image-1 integrate tightly with GPT-4 for prompt understanding before generation, which means they handle complex compositional prompts better than raw CLIP conditioning.

TIP

The underlying math is the same. The training data, fine-tuning, and guidance techniques are what differentiate them.

The math is unromantic. The output, often, isn't.