AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

Diffusion Models

How AI learned to make images by starting with pure noise and finding the signal

Mankaran Singh·Updated May 17, 2026

Where this idea lives

PREREQUISITESTOOLS THAT SHOW ITDiffusion ModelsHow AI Models Are TrainedHow AI Models Are Trained — From random noise to a model that can reason — the actual pipelineMultimodal ModelsMultimodal Models — When AI learned to see, listen, and read — at the same time, in the same headEmbeddingsEmbeddings — The coordinates that give language a sense of directionMidjourneyMidjourneyAdobe FireflyAdobe FireflyLeonardo AILeonardo AIRunway MLRunway MLCommon misconception: Diffusion models paint the way humans do.Common misconception: More inference steps always means a better image.Common misconception: Diffusion only generates images.
prereqsrelatedtoolsmisconceptions
shows up in:Graphic Design & Visual ArtsFilm & Video ProductionArchitecture & Urban DesignPharmacy & Drug Discovery
You might think:Diffusion models paint the way humans do.More inference steps always means a better image.Diffusion only generates images.

Every AI-generated image you've seen — Midjourney portraits, Stable Diffusion landscapes, that thing your cousin keeps posting — was made by a model that learned to start with random noise and gradually find signal.

This is the core idea behind models: rather than learning to draw images directly, they learn to denoise them. The training process is deceptively simple. The interesting thing is what the model learns along the way.

Diffusion models borrowed their mathematical foundation from non-equilibrium thermodynamics — the physics of how systems move toward and away from equilibrium. Yes, really.
§

How the training works

The insight comes from asking a seemingly backward question: instead of learning to create images, what if we learn to destroy them — and then reverse the process?

1

Start

Take a real photograph

›
2

Add noise

Gradually corrupt it with random Gaussian noise over ~1000 steps

›
3

Pure noise

After enough steps, just static — original image is unrecoverable

›
4

Train reversal

Model learns to predict and remove each noise step backward

›
5

Generate

Start from pure noise, reverse 1000 steps — image emerges

At training time, the model sees images at every stage of corruption and learns to predict: "given this noisy image at step T, what does the slightly less noisy version at step T−1 look like?"[·]

Repeat that 1000 times, and you have a path from pure noise to a coherent image.

INSIGHT

The model never sees the generation process during training — only the denoising step. This is what makes the math tractable. Training a model to generate images directly is extremely hard. Training one to predict noise is much simpler.

If "predict the noise that was added" sounds weirdly indirect — it is. It's the entire reason diffusion beat the older generation of generative models (GANs) commercially. Easier to train, more stable, scales better.
§

How text conditioning works

A diffusion model trained only on images can generate images — but random ones. To steer the generation toward a specific subject, style, or composition, the model needs to understand text.

This is where CLIP (Contrastive Language-Image Pretraining)[·] comes in. CLIP is trained separately to map images and text descriptions into the same mathematical space — so "a photograph of a golden retriever in autumn leaves" ends up close to the vector representation of images fitting that description.

Stable Diffusion's text encoder is CLIP. When you type a prompt, it's first converted to a CLIP vector, which is then used to guide every denoising step — nudging each step toward images that match the description.

During diffusion model training, text from CLIP are fed into each denoising step as conditioning information. The model learns: "when the text says X, steer the denoising toward images that look like X."

At generation time: type a prompt → encode with CLIP → use that encoding to guide 50–1000 denoising steps from random noise → image.

§

Why it produces coherent detail

One counterintuitive property of diffusion models: they produce remarkably coherent images even at fine detail, despite never having a global "plan" for the image.

The reason is that the denoising process operates hierarchically in practice. Early steps (high noise) establish large-scale structure — composition, rough shapes, colour relationships. Later steps (low noise) fill in fine details within that established structure.

Early denoising steps
Late denoising steps
Noise level
Very high — most of the image is static
Very low — mostly committed image with small noise
What changes
Large-scale composition, dominant colours, rough shapes
Fine texture, sharp edges, lighting details
Impact of error
High — a wrong early step affects everything downstream
Low — recoverable in subsequent steps
Text influence
Strong — high-level semantics guide structure
Subtle — detail refinement within established structure
“The model never plans the whole image. It just gets a little less wrong, a thousand times in a row.

This is also why (classifier-free guidance) matters. Higher guidance pushes the generation more strongly toward the text prompt at every step — which produces more "on-prompt" images but can cause oversaturation and artefacts.[·]

§

The model zoo

Diffusion models come in several flavours, and the differences matter for what you're building:

Stable Diffusion (and its variants — SDXL, SD3) is open-source and can be run locally. Enormous community ecosystem, extensive fine-tuned variants for specific styles. Slower than API options, but the only one where you truly control the weights.

Midjourney runs proprietary diffusion models tuned for aesthetic quality. The results often have a distinct look — high coherence, painterly quality — but you can't self-host it.

DALL-E 3 and GPT-Image-1 integrate tightly with GPT-4 for prompt understanding before generation, which means they handle complex compositional prompts better than raw CLIP conditioning.

If you've ever wondered why DALL-E "understands" a complex prompt like "a confused-looking accountant standing next to a robot holding a cup of coffee" — it's because there's a language model preprocessing the prompt before the diffusion model ever sees it.

TIP

For production use: DALL-E 3 handles complex compositions, Stable Diffusion gives you control and privacy, Midjourney produces the most immediately impressive aesthetics. Pick based on what your use case actually needs.

The underlying math is the same. The training data, fine-tuning, and guidance techniques are what differentiate them.


The path from pure noise to a coherent image is one of the strangest computations humans have built. Nothing in the model "imagines" the picture in advance — it just gets a little less wrong, a thousand times in a row, guided by what your prompt embedded into the space of all possible images.

The math is unromantic. The output, often, isn't.

← Back to all conceptsBrowse tools →
intermediate
Read time8 min read
UpdatedMay 2026
Sources2

Read next

  1. Multimodal Models →
  2. Embeddings →
  3. How AI Models Are Trained →