Denosing the noise

All of the research finally converges at this point, the technical narrative of the diffusion model itself. A common misconception is that images in AI were generated out of thin air, but this is far from the truth. Much like sculpting from a raw stone, diffusion begins with random noise, pure entropy, and iteratively reduces the noise into a coherent visual representation, or as a form of human understanding: an image. This transformation of noise is guided by statistical patterns learned through convolution and cross-attention mechanisms, and orchestrated by architectures like U-Net and steered by models like CLIP, which align visual outputs with textual prompts.

The basic principle of diffusion models was formalized as Denoising Diffusion Probabilistic Models (DDPM)1, which introduced two processes: the training phase and the generation phase. Forward diffusion (used in training) gradually adds Gaussian noise to an image over a series of steps until it becomes indistinguishable from random noise. Rather than directly learning to reverse this process, the model learns to predict the noise added at each step, which correlates to the image and caption input.

Once the models is trained, it performs a reverse diffusion to generate a new image. Starting from pure noise, it iteratively removes noise at each steps. In a normal text-to-image models, this denoising process is guided by text encoding with models such as CLIP, and integrated with cross-attention. This allows the system to guide the generation process toward a semantically meaningful output that correlates to the text prompt.

Typically, the architecture of the model is a U-Net, which allows the model to maintain spatial details while learning semantic features. However, performing diffusion in full pixel space is computationally demanding. This challenge is addressed by High-Resolution Image Synthesis with Latent Diffusion Models2 by Rombach et al., which introduced Latent Diffusion Models (LDMs). Instead of working with full-resolution images, LDMs operate in a compressed latent space, significantly improving efficiency without sacrificing image quality. The diffusion process happens in this latent representation, which is later decoded into a final image with Variational Autoencoder (VAE)3. Notably, the authors of LDMs paper played an initial role in the development of Stable Diffusion, the text-to-image model that I heavily rely on.

The diffusion process has both interesting and depressing implications. First, it relies on the notion of what counts as noise versus signal in an image. If certain visual features were rare or considered noise in the training set, the model might literally treat them as noise to be removed. For example, if very few training images showed nonbinary gender presentations, the model might "wash out" ambiguous gender attributes as it denoises, since there is a stronger probability for binary gender appearance4. In this way, the diffusion model is tinted by the statistical frequency of things in its training data. What it doesn't often see, it may not generate clearly.

Secondly, diffusion confers a homogenizing effect. Each denoising step averages out possibilities based on learned probabilities. Unless prompted with a very specific style, most AI images can feel aesthetically similar, reflecting popular internet imagery. The process could homogenize creative output5, leading to the loss of individuality and recreating biases and stereotypes. Technically, this happens because the model tends toward high-probability states during generation, and yet again, truly low-probability features get "noised out."

In conclusion, diffusion models is now part in our cultural landscape. By looking into the technical aspects, it becomes clearer what actually resulted in the problems that are often criticized. Diffusion models, like any other medium, reflects the values of their users. After all, every new tools: the camera, the computer, and now the algorithm, ultimately reshapes how we see the world and understand ourselves. This journey of making sense of how it works made me realize its limitations and potential.

And yet, the patterns a diffusion model amplifies are not just shaped by training data, they are conditioned by the infrastructures and industries that make computation possible in the first place. The decisions embedded in model architecture are intertwined with the physical limitations. And every aesthetic tendency is made possible by vast calculation on electronics. This is also where the materiality of artificial intelligence must be confronted. In the end, diffusion is not just a statistical process, it is a material one.