GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

Diffusion Models Explained: How Image Generation Actually Works

What diffusion models actually do: the forward noising process, the reverse denoising network, why U-Net matters, and how text conditioning (CLIP + cross-attention) turns a text prompt into an image. No hand-waving.

Diffusion models are the architecture behind Stable Diffusion, DALL-E 2/3, Midjourney, and Imagen. They generate images by learning to reverse a gradual noise process. The intuition: if you can learn to remove noise from a noisy image, you can generate images from pure noise.

The Forward Process (Adding Noise)

Training starts with a real image. Over T timesteps (typically T=1000), Gaussian noise is incrementally added until the image is pure noise — statistically indistinguishable from a sample from N(0, I). This is the forward process, and it's not learned — it's fixed by a mathematical schedule (linear, cosine, or quadratic noise schedule).

At each timestep t, the noisy image x_t can be computed directly from the original image x_0 and the noise schedule — no need to iterate through all previous steps. This efficiency is critical: it means you can sample any noise level at random during training, not step through sequentially.

The Reverse Process (Learning to Denoise)

The model's job is to learn the reverse: given a noisy image x_t at timestep t, predict the noise that was added (or equivalently, predict x_0 directly). The model is trained to minimize the difference between predicted and actual noise. This is the only learned component — and it's a neural network trained on millions of (noisy image, clean image, timestep) examples.

The core training objective: predict the noise ε added to an image at timestep t. Loss = ||ε - ε_θ(x_t, t)||². This simple MSE loss on noise prediction is what the entire diffusion model trains on.

The U-Net: Architecture of the Denoiser

The denoising network is a U-Net: an encoder-decoder architecture with skip connections between encoder and decoder at matching resolutions. Inputs: the noisy image + the timestep t (embedded as a sinusoidal positional embedding). Output: predicted noise. The U-Net operates in pixel space (for small models) or latent space (for LDMs).

Modern diffusion models replace the U-Net with a transformer — the DiT (Diffusion Transformer) architecture used in Stable Diffusion 3, FLUX, and Sora. Transformers scale more predictably with compute than U-Nets and handle higher resolutions better.

Text Conditioning: How Text Prompts Guide Generation

Unconditional diffusion models generate random images. Text-conditioned models (Stable Diffusion, DALL-E) steer the denoising process with a text embedding. The standard mechanism: cross-attention in the U-Net/DiT. The text embedding (from a CLIP or T5 text encoder) is used as keys and values in cross-attention layers. The noisy image attends to the text, letting the text guide what noise to remove.

Latent Diffusion Models (LDMs)

Diffusion in pixel space is expensive. Stable Diffusion introduced Latent Diffusion Models: instead of denoising in pixel space (512×512×3 = 786k dimensions), first compress the image to a small latent (64×64×4 = 16k dimensions) using a VAE. Run diffusion in latent space. Decode back to pixels with the VAE decoder. This is a 50× reduction in the dimensionality of the denoising problem — making high-resolution generation tractable on consumer GPUs.

Inference: Sampling

At inference time: start from pure Gaussian noise. Iteratively apply the denoiser for T' steps (typically 20–50 in practice with fast samplers like DDIM, DPM++). Each step removes some noise guided by the text conditioning. After T' steps, decode the latent to a pixel image.

The quality-speed tradeoff in diffusion is entirely in the number of inference steps. 20 DDIM steps gives ~80% of the quality of 1000 DDPM steps at 50× the speed. For production, SDXL + DPM++ 2M Karras at 20 steps is the standard configuration.

Explore →: Compare multimodal model capabilities in the Explore module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →