An attention mechanism is a machine learning technique that directs deep learning models to prioritize (or attend to) the most relevant parts of input data. Innovation in attention mechanisms enabled the transformer architecture that yielded the modern large language models (LLMs) that power popular applications like ChatGPT.
As their name suggests, attention mechanisms are inspired by the ability of humans (and other animals) to selectively pay more attention to salient details and ignore details that are less important in the moment. Having access to all information but focusing on only the most relevant information helps to ensure that no meaningful details are lost while enabling efficient use of limited memory and time.
Mathematically speaking, an attention mechanism computes attention weights that reflect the relative importance of each part of an input sequence to the task at hand. It then applies those attention weights to increase (or decrease) the influence of each part of the input, in accordance with its respective importance. An attention model—that is, an artificial intelligence model that employs an attention mechanism—is trained to assign accurate attention weights through supervised learning or self-supervised learning on a large dataset of examples.
Attention mechanisms were originally introduced by Bahdanau et al in 2014 as a technique to address the shortcomings of what were then state-of-the-art recurrent neural network (RNN) models used for machine translation. Subsequent research integrated attention mechanisms into the convolutional neural networks (CNNs) used for tasks such as image captioning and visual question answering.
In 2017, the seminal paper “Attention is All You Need” introduced the transformer model, which eschews recurrence and convolutions altogether in favor of only attention layers and standard feedforward layers. The transformer architecture has since become the backbone of the cutting-edge models powering the ongoing era of generative AI.
While attention mechanisms are primarily associated with LLMs used for natural language processing (NLP) tasks, such as summarization, question answering, text generation and sentiment analysis, attention-based models are also used widely in other domains. Leading diffusion models used for image generation often incorporate an attention mechanism. In the field of computer vision, vision transformers (ViTs) have achieved superior results on tasks including object detection,1 image segmentation2 and visual question answering.3
Transformer models and the attention mechanisms that power them have achieved state-of-the-art results across nearly every subdomain of deep learning. The nature of attention mechanisms gives them significant advantages over the convolution mechanisms used in convolutional neural networks (CNNs) and recurrent loops used in recurrent neural networks (RNNs).
To understand how attention mechanisms in deep learning work and why they helped spark a revolution in generative AI, it helps to first understand why attention was first introduced: to improve the RNN-based Seq2Seq models used for machine translation.
RNNs are neural networks with recurrent loops that provide an equivalent of “memory,” enabling them to process sequential data. RNNs intake an ordered sequence of input vectors and process them in timesteps. After each timestep, the resulting network state—called the hidden state—is provided back to the loop, along with the next input vector.
RNNs quickly suffer from vanishing or exploding gradients in training. This made RNNs impractical for many NLP tasks, as it greatly limited the length of input sentences they could process.4 These limitations were somewhat mitigated by an improved RNN architecture called long short term memory networks (LSTMs), which add gating mechanisms to preserve “long term” memory.
Before attention was introduced, the Seq2Seq model was the state-of-the-art model for machine translation. Seq2Seq uses two LSTMs in an encoder-decoder architecture.
Encoding input sequences in a fixed number of dimensions allowed Seq2Seq to process sequences of varying length, but also introduced important flaws:
Bahdanau et al proposed an attention mechanism in their 2014 paper, “Neural Machine Translation by Jointly Learning to Align and Translate,” to improve communication between the encoder and decoder and remove that information bottleneck.
Instead of passing along only the final hidden state of the encoder—the context vector—to the decoder, their model passed every encoder hidden state to the decoder. The attention mechanism itself was used to determine which hidden state—that is, which word in the original sentence—was most relevant at each translation step performed by the decoder.
“This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word,” the paper explained. “This has a major positive impact on the ability of the neural machine translation system to yield good results on longer sentences."5
Subsequent NLP research focused primarily on improving performance and expanding use cases for attention mechanisms in recurrent models. The 2017 invention of transformer models, powered solely by attention, eventually made RNNs all but obsolete for NLP.
An attention mechanism’s primary purpose is to determine the relative importance of different parts of the input sequence, then influence the model to attend to important parts and disregard unimportant parts.
Though there are many variants and categories of attention mechanisms, each suited to different use cases and priorities, all attention mechanisms feature three core processes:
The seminal “Attention is All You Need” paper articulated its attention mechanism by using the terminology of a relational database: queries, keys and values. Relational databases are designed to simplify the storage and retrieval of relevant data: they assign a unique identifier (“key”) to each piece of data, and each key is associated with a corresponding value. In NLP, a model’s “database” is the vocabulary of tokens it has learned from its training dataset.
The massive influence of the “Attention is All You Need” paper has resulted in even previous attention mechanisms often being retroactively described in these terms. Generally speaking, this conception of attention entails interaction between three types of vector representations for each token in a sequence.
Specific attention mechanism variants are differentiated primarily by how vectors are encoded, how alignment scores are calculated and attention weights are applied to provide the model with relevant information.
Badhanau’s attention mechanism was designed specifically for machine translation. It uses a bidirectional RNN to encode each input token, processing the input sequence in both the forward direction and in reverse and concatenating the results together. This approach is particularly useful when, for example, the original and translated languages have different ordering conventions for nouns and adjectives.
Here, the decoder hidden state at each timestep of the translated sentence is the equivalent of a query vector and the encoder hidden state at each step in the source sentence is the equivalent of a key vector.
Alignment scores are then determined by a simple feedforward neural network, the attention layer, jointly trained with the rest of the model. This attention layer comprises up to three subsets of learnable model weights: query weights for the hidden decoder states (“Wq”), key weights for hidden encoder states (“Wk”) and value weights to scale the final output (“wv”). These weights are the model’s “knowledge”: by adjusting the specific values of those weights during training to minimize a loss function, the model learns to make accurate translations.
At each step, additive attention works as follows:
The context vector that the decoder uses to generate the translated sentence is calculated as the attention-weighted sum of each key vector. One benefit of additive attention is that it does not require query and key vectors to be the same length.
In 2015, Luong et al introduced several novel methodologies to simplify and enhance Badhanau’s attention mechanism for machine translation. Perhaps their most notable contribution was a new alignment score function that used multiplication instead of addition. It also eschewed the function, calculating the similarity between hidden state vectors by using their dot product. For that reason, it’s often called dot product attention or multiplicative attention.
The intuition behind using dot product to compare query vectors is both mathematical and pragmatic:
One consequence of using dot product attention is that dot product calculations require both vectors to have the same number of dimensions, .
Whereas additive attention proceeds to calculate the context vector as the weighted sum of key vectors, dot product attention computes the context vector as the weighted average of key vectors.
The authors of “Attention is All You Need” noted that while dot product attention is faster and more computationally efficient than additive attention, additive attention outperforms traditional dot-product attention for longer vectors.
They theorized that when is very large, the resulting dot products are also very large. When the softmax function squishes all those very large values to fit between 0–1, backpropagation yields extremely small gradients that are difficult to optimize. Experimentation revealed that scaling the dot product of two vectors of length by before softmax normalization results larger gradients and therefore, smoother training.
The scaled dot-product attention function used in transformer models is written as .
The earliest types of attention mechanisms all performed what is now categorized as cross-attention. In cross-attention, queries and keys come from different data sources. For instance, in machine translation tasks the keys come from a text corpus in one language and the queries from another language; in speech recognition tasks, queries are audio data and keys are text data to transcribe that audio.
In self-attention, queries, keys and values are all drawn from the same source. Whereas both Bahdanau and Luong’s attention mechanisms were explicitly designed for machine translation, Cheng at al proposed self-attention—which they called “intra-attention”—as a method to improve machine reading in general. Their attention mechanism, outlined in a 2016 paper, explored not how input elements contribute to an overall sequence, but how different input tokens relate to each other.
Consider a language model interpreting the English text
"on Friday, the judge issued a sentence."
Cheng et al’s paper focused solely on self-attention’s capacity to read and understand text, but it soon followed that modeling intrasequence relationships could also be a powerful tool for writing text. Further development of self-attention, along with the transformer models it enabled, led directly to the advent of modern generative AI and autoregressive LLMs that can generate original text.
Autoregressive LLMs can also perform machine translation by using self-attention, but must approach the task differently. Whereas cross-attention treats the original source sentence and the translated sentence as two distinct sequences, self-attention treats the original text and the translated text as one sequence.
For an autoregressive, self-attention-based LLM to be capable of translating text, all of the words the model encounters in training—across every language—are learned as part of one large multilingual token vocabulary. The model simply realizes that when a sequence contains instructions like “translate [words in Language 1] into Language 2,” the next words in the sequence should be tokens from Language 2.
In essence, an autoregressive LLM doesn’t necessarily understand that there are different languages by itself. Instead, it simply understands how certain groupings of tokens—in this case, tokens corresponding to words from the same language—attend to one another. This contextual understanding is further reinforced through techniques such as instruction tuning.
The “Attention is All You Need” paper, authored by Viswani et al, took inspiration from self-attention to introduce a new neural network architecture: the transformer. Their transformer model eschewed convolutions and recurrence altogether, and instead used only attention layers and standard linear feedforward layers.
The authors’ own model followed an encoder-decoder structure, similar to that of its RNN-based predecessors. Later transformer-based models departed from that encoder-decoder framework. One of the first landmark models released in the wake of the transformers paper, BERT (short for bidirectional encoder representations from transformers), is an encoder-only model. The autoregressive LLMs that have revolutionized text generation, such as GPT (Generative Pretrained Transformer) models, are decoder-only.
“Attention is All You Need” proposed several innovations to the attention mechanism—one of which was scaled dot product attention—to improve performance and adapt attention to an entirely new model structure.
The relative order and position of words can have an important influence on their meanings. Whereas RNNs inherently preserve information about the position of each token by computing hidden states serially, one word after the other, transformer models must explicitly encode positional information.
With positional encoding, the model adds a vector of values to each token’s embedding, derived from its relative position, before the input enters the attention mechanism. This positional vector typically has much fewer dimensions than the token embedding itself, so only a small subset of the token embedding will receive positional information. The math is somewhat complex, but the logic is simple:
Viswani et al designed a simple algorithm that uses a sine function for tokens in even positions and cosine for tokens in odd positions. Later algorithms, such as rotary positional encoding (RoPE), improved the ability to effective encode positional information for very long sequences—which, in turn, has helped enable LLMs with larger context windows.
Once each token embedding has been updated with positional information, each is used to generate three new vectors by passing the original token embedding through each of three parallel linear (feedforward) neural network layers that precede the first attention layer. Each parallel layer has a unique matrix of weights whose specific values are learned through self-supervised pretraining on a massive dataset of text.
The attention mechanism’s primary function is to weight the importance of the query-key pairings between each token. For each token x in an input sequence, the transformer model computes (and then applies) attention weights as follows:
Averaging the attention-weighted contributions from other tokens instead of accounting for each attention-weighted contribution individually is mathematically efficient, but it results in a loss of detail. The transformer architecture addresses this by implementing multihead attention.
To enjoy the efficiency of averaging while still accounting for multifaceted relationships between tokens, transformer models compute self-attention operations multiple times in parallel at each attention layer in the network. Each original input token embedding is split into h evenly sized subsets. Each piece of the embedding is fed into one of h parallel matrices of Q, K and V weights, each of which are called a query head, key head or value head, respectively. The vectors output by each of these parallel triplets of query, key value heads are then fed into a corresponding attention head.
In the final layers of each attention block, the outputs of these h parallel circuits are eventually concatenated back together. In practice, model training results in each circuit learning different weights that capture a separate aspect of semantic meanings. This, in turn, lets the model process different ways that context from other words can influence a word’s meaning. For instance, one attention head might specialize in changes in tense, while another specializes in how nearby words influence tone.
The entire circuit of matrix multiplication in the attention block of a standard transformer is demonstrated here. It's worth noting that later evolutions of the transformer's attention mechanism, such as multiquery attention and grouped query attention, simplify or combine some elements of the process to reduce computational demands.
In the final few layers of transformer models, attention heads are often trained to make specific predictions. For instance, one attention head in the final layer of an LLM might specialize in named entity recognition, while another specializes in sentiment analysis, and so on.
In autoregressive LLMs, the penultimate layer is a linear layer that receives the fully transformed vector and projects it to a size matching that of the vector embeddings the model has learned for each token in its vocabulary. This allows for the computation of scores representing how closely the resulting vector matches each token in that vocabulary. The final layer is a softmax layer, which converts those scores into probabilities (out of 1) and uses those probabilities to output what it determines to be the most likely next word, based on the words that preceded it.
1. "Leaderboard: Object Detection on COCO test-dev," Papers With Code, accessed 18 November 2024
2. "Leaderboards: Image Segmentation" Papers With Code, accessed 18 November 2024
3. "Leaderboard: Visual Question Answering (VQA) on VQA v2 test-dev," Papers With Code, accessed 18 November 2024
4. "Learning long-term dependencies with gradient descent is difficult," IEE Transactions on Neural Networks 5(2): 157-66, February 1994
5. "Neural Machine Translation by Jointly Learning to Align and Translate," arXiv, 1 September 2014
6. "Multiplicative Attention," Papers With Code
Learn how CEOs can balance the value generative AI can create against the investment it demands and the risks it introduces.
Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.
Learn how to confidently incorporate generative AI and machine learning into your business.
Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
IBM® Granite™ is our family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Learn how to select the most suitable AI foundation model for your use case.
Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.