Every AI-generated image you've seen — Midjourney portraits, Stable Diffusion landscapes, that thing your cousin keeps posting — was made by a model that learned to start with random noise and gradually find signal.
This is the core idea behind models: rather than learning to draw images directly, they learn to denoise them. The training process is deceptively simple. The interesting thing is what the model learns along the way.
How the training works
The insight comes from asking a seemingly backward question: instead of learning to create images, what if we learn to destroy them — and then reverse the process?
Start
Take a real photograph
Add noise
Gradually corrupt it with random Gaussian noise over ~1000 steps
Pure noise
After enough steps, just static — original image is unrecoverable
Train reversal
Model learns to predict and remove each noise step backward
Generate
Start from pure noise, reverse 1000 steps — image emerges
At training time, the model sees images at every stage of corruption and learns to predict: "given this noisy image at step T, what does the slightly less noisy version at step T−1 look like?"[·]
Repeat that 1000 times, and you have a path from pure noise to a coherent image.
INSIGHT
The model never sees the generation process during training — only the denoising step. This is what makes the math tractable. Training a model to generate images directly is extremely hard. Training one to predict noise is much simpler.
How text conditioning works
A diffusion model trained only on images can generate images — but random ones. To steer the generation toward a specific subject, style, or composition, the model needs to understand text.
This is where CLIP (Contrastive Language-Image Pretraining)[·] comes in. CLIP is trained separately to map images and text descriptions into the same mathematical space — so "a photograph of a golden retriever in autumn leaves" ends up close to the vector representation of images fitting that description.
During diffusion model training, text from CLIP are fed into each denoising step as conditioning information. The model learns: "when the text says X, steer the denoising toward images that look like X."
At generation time: type a prompt → encode with CLIP → use that encoding to guide 50–1000 denoising steps from random noise → image.
Why it produces coherent detail
One counterintuitive property of diffusion models: they produce remarkably coherent images even at fine detail, despite never having a global "plan" for the image.
The reason is that the denoising process operates hierarchically in practice. Early steps (high noise) establish large-scale structure — composition, rough shapes, colour relationships. Later steps (low noise) fill in fine details within that established structure.
This is also why (classifier-free guidance) matters. Higher guidance pushes the generation more strongly toward the text prompt at every step — which produces more "on-prompt" images but can cause oversaturation and artefacts.[·]
The model zoo
Diffusion models come in several flavours, and the differences matter for what you're building:
Stable Diffusion (and its variants — SDXL, SD3) is open-source and can be run locally. Enormous community ecosystem, extensive fine-tuned variants for specific styles. Slower than API options, but the only one where you truly control the weights.
Midjourney runs proprietary diffusion models tuned for aesthetic quality. The results often have a distinct look — high coherence, painterly quality — but you can't self-host it.
DALL-E 3 and GPT-Image-1 integrate tightly with GPT-4 for prompt understanding before generation, which means they handle complex compositional prompts better than raw CLIP conditioning.
TIP
For production use: DALL-E 3 handles complex compositions, Stable Diffusion gives you control and privacy, Midjourney produces the most immediately impressive aesthetics. Pick based on what your use case actually needs.
The underlying math is the same. The training data, fine-tuning, and guidance techniques are what differentiate them.
The path from pure noise to a coherent image is one of the strangest computations humans have built. Nothing in the model "imagines" the picture in advance — it just gets a little less wrong, a thousand times in a row, guided by what your prompt embedded into the space of all possible images.
The math is unromantic. The output, often, isn't.