In contrast to processing textual data via neural networks, image processing is never optimal for working in such an amount. Processing a 128x128 greyscale coffee pot image requires a computer to handle around 16000 pixels. As image size increases, the computational complexity grows exponentially, especially when factoring in color channels. To manage this problem, machine learning models process images through convolutions, like squinting your eye; it perceives the contours of an image rather than finding details. The algorithmic principle of convolutions is the repeated application of filters, also known as kernels, averaging pixel relationships. The convolutional processes highlight specific features such as textures, edges, and patterns and detect shapes and objects more effectively (though deductive) in order to have a meaningful understanding of the overall structure rather than its individual pixel values. Furthermore, Pooling layers, such as max pooling, condense image data by selecting the most prominent features, ensuring robustness against minor variations like shifts or distortions. This enables convolutional neural networks (CNNs), a class of AI algorithms inspired by the human visual cortex, to generalize well across different images of the same object.
In the case of multilayer neural networks, a convolution layer acts as a set of texture/shape detectors that scan across an image. The first layers of a CNN might detect edges or blobs of color, while deeper layers detect more complex forms (eyes, wheels, etc. depending on what it's trained on). This hierarchical feature detection is why CNNs automatically learn the right visual vocabulary necessary for the task at hand.
Photography has been questioning the aura that erodes, captures, and digitalizes, as in how the transition from lived experience to digital still image reduced reality into a finite and compressed state. Yet even this reduced reality holds aesthetic value, perceptible by the human eyes, which humans can process. However, once it passes through convolutions, even the human eye struggles to recognize the mechanism in question. The image becomes a language of features and abstractions, no longer a surface to be seen but a structure to be interpreted.