AI Engineering 10 min read

Diffusion Models Explained: How Image Generation Actually Works

What diffusion models actually do: the forward noising process, the reverse denoising network, why U-Net matters, and how text conditioning (CLIP + cross-attention) turns a text prompt into an image. No hand-waving.

Diffusion models are the architecture behind Stable Diffusion, DALL-E 2/3, Midjourney, and Imagen. They generate images by learning to reverse a gradual noise process. The intuition: if you can learn to remove noise from a noisy image, you can generate images from pure noise.

The Forward Process (Adding Noise)

Training starts with a real image. Over T timesteps (typically T=1000), Gaussian noise is incrementally added until the image is pure noise — statistically indistinguishable from a sample from N(0, I). This is the forward process, and it's not learned — it's fixed by a mathematical schedule (linear, cosine, or quadratic noise schedule).

At each timestep t, the noisy image x_t can be computed directly from the original image x_0 and the noise schedule — no need to iterate through all previous steps. This efficiency is critical: it means you can sample any noise level at random during training, not step through sequentially.

The Reverse Process (Learning to Denoise)

The model's job is to learn the reverse: given a noisy image x_t at timestep t, predict the noise that was added (or equivalently, predict x_0 directly). The model is trained to minimize the difference between predicted and actual noise. This is the only learned component — and it's a neural network trained on millions of (noisy image, clean image, timestep) examples.

The core training objective: predict the noise ε added to an image at timestep t. Loss = ||ε - ε_θ(x_t, t)||². This simple MSE loss on noise prediction is what the entire diffusion model trains on.

The U-Net: Architecture of the Denoiser

The denoising network is a U-Net: an encoder-decoder architecture with skip connections between encoder and decoder at matching resolutions. Inputs: the noisy image + the timestep t (embedded as a sinusoidal positional embedding). Output: predicted noise. The U-Net operates in pixel space (for small models) or latent space (for LDMs).

Modern diffusion models replace the U-Net with a transformer — the DiT (Diffusion Transformer) architecture used in Stable Diffusion 3, FLUX, and Sora. Transformers scale more predictably with compute than U-Nets and handle higher resolutions better.

Text Conditioning: How Text Prompts Guide Generation

Unconditional diffusion models generate random images. Text-conditioned models (Stable Diffusion, DALL-E) steer the denoising process with a text embedding. The standard mechanism: cross-attention in the U-Net/DiT. The text embedding (from a CLIP or T5 text encoder) is used as keys and values in cross-attention layers. The noisy image attends to the text, letting the text guide what noise to remove.

Stable Diffusion uses CLIP ViT-L as the text encoder.
Stable Diffusion XL uses both CLIP ViT-L and OpenCLIP ViT-bigG concatenated.
Stable Diffusion 3 and FLUX use T5-XXL (a more powerful text encoder) alongside CLIP — better at long, complex prompts.
DALL-E 3 uses a GPT-4 generated caption of the image as the conditioning text — dramatically improving prompt adherence by training on high-quality captions.

Latent Diffusion Models (LDMs)

Diffusion in pixel space is expensive. Stable Diffusion introduced Latent Diffusion Models: instead of denoising in pixel space (512×512×3 = 786k dimensions), first compress the image to a small latent (64×64×4 = 16k dimensions) using a VAE. Run diffusion in latent space. Decode back to pixels with the VAE decoder. This is a 50× reduction in the dimensionality of the denoising problem — making high-resolution generation tractable on consumer GPUs.

Inference: Sampling

At inference time: start from pure Gaussian noise. Iteratively apply the denoiser for T' steps (typically 20–50 in practice with fast samplers like DDIM, DPM++). Each step removes some noise guided by the text conditioning. After T' steps, decode the latent to a pixel image.

The quality-speed tradeoff in diffusion is entirely in the number of inference steps. 20 DDIM steps gives ~80% of the quality of 1000 DDPM steps at 50× the speed. For production, SDXL + DPM++ 2M Karras at 20 steps is the standard configuration.

Explore →: Compare multimodal model capabilities in the Explore module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →