Optimized Noise Maker

Having set the stage with the broader concept of AI, I want to turn to the specific class of generative models that akin to my practice: diffusion models. These models are at the core of Stable Diffusion, and understanding how they work not only illuminates why we got that purple image in the feedback experiment but also provides insight into the framework of generative AI's approach creating images.

Stable Diffusion's model is an implementation of Latent Diffusion1 which improved efficiency by doing the diffusion in a compressed latent space (with the U-Net which we will later discuss) rather than pixel space. This process is guided by two things: the learned model (how to denoise in general) and the conditioning (the text prompt) which guides the model towards a particular kind of image. A given text prompt "sunset over ocean" will subtly nudge each denoising step to favor features associated with sunsets and oceans. Technically, this is done by cross-attention layers that incorporate textual embeddings with the image generation process, effectively allowing the words to influence the image features at every step. In this section, a step-by-step explanation of how diffusion models work will be provided, which is crucial for understanding the hidden aesthetic qualities in generative models.

Sections