Introduced in the transformer architecture by researchers in Google, in their paper, Attention is all you need1, attention represents a major shift from traditional convolutional methods. Attention allows the model to dynamically focus on the most relevant parts of the input, whether a specific pixel in an image or a word in a sentence. In essence, attention enables a model to consider the entirety of data simultaneously, magically allowing model to read context, weighing the relevance of different input data. Technically, attention operates by transforming each input element into three vectors: queries, keys, and values. The model computes attention scores by taking the dot product of queries and keys, normalizes these scores using the softmax2 function, and then uses them to weigh the values.
In image models, attention does more than just understand details; it makes the image coherent. It compares and relates features and segmentations across the whole image, ensuring that visual patterns are not just present but purposeful. When combined with text, through what's known as cross-attention, the model begins to interpret language as a visual possibility. A tokenized sentence becomes a sequence of seeds (which does not hold a linear order), and the model attends to each token as it sketches, imagines, and reads an image pixel by pixel.
Even though I can not fully grasp the complexity of such models, it seems to me that attention mechanisms bring a degree of intention in machine behavior. Attention echos how our perception works by selection and contextualization in an often recursive manner. Instead of passively processing data to conventional conditions, the model begins to attend, to perceive relationships, to prioritize, and to interpret. With attention, the machine sees not just structure as data points but a degree of meaning that is represented in a form that we may never fully comprehend. Some argue that attention could be accounted as a primitive form of cognition. Yet our cognition is embedded in lived experience, shaped by embodiment, emotion, memory, and the usage of language. While attention in machines may resemble a cognitive gesture, it remains fundamentally different from how we know and understand the world.