Self-attention is a type of attention mechanism used in machine learning models. This mechanism is used to weigh the importance of tokens or words in an input sequence to better understand the relations between them. It is a crucial part of transformer models, a powerful artificial intelligence architecture that is essential for natural language processing (NLP) tasks. The transformer architecture is the foundation for most modern large language models (LLMs).
The self-attention mechanism was introduced by means of the transformer, a model neural network architecture proposed by researchers. The aim of the proposed architecture was to address the challenges of traditional machine learning models that use convolution neural networks (CNNs) and recurrent neural networks (RNNs).1
Traditional sequential models follow the same encoder-decoder architecture as transformer models but process data step-by-step or sequence-to-sequence (seq2seq). This function poses a challenge for parallelization, which is the ability to reduce computation time and enhance output generation by calculating attention weights across all parts of the input sequence simultaneously.
Self-attention played a key role in the advancement of LLMs by enabling parallelization within training examples. This method is useful because the longer the sequence length, the more memory constraints limit batching across training examples. Using self-attention, LLM training data can be split into batches and processed concurrently on multiple GPUs.1 Self-attention reduces the computational power needed to train machine learning models with efficient batching processed in parallel.
Not only does self-attention contribute to distributing the computational load efficiently, but it also enables the ability to process attention weights simultaneously. This ability allows the model to focus on relevant parts of an input sequence to dynamically predict the importance of each element within a sequence. Self-attention is good for NLP tasks such as machine translation, sentiment analysis and summarization.
Self-attention in machine learning models is similar to the human behavioral concept in that they both involve focusing on relevant elements within a larger context to accurately process information. In psychology, it is about focusing on your own thoughts or behaviors, while in deep learning, it is about focusing on the relevant parts of an input sequence.
The transformer architecture includes a self-attention layer where the attention process is integrated. The steps are explained as presented in the paper by Ashish Vaswani et al. introducing the self-attention layer “Attention is All You Need.”
An input sequence is a series of data points vectorized into embeddings, or numerical representations, that the machine learning algorithm can use to calculate attention scores needed to produce an output sequence.
In machine translation, a sentence would be considered an input sequence, where each part of the sentence is considered a data point or input token. Tokens are converted into embeddings that act as semantic units that the model can process.2 The embeddings are used to calculate the attention weights that help the model prioritize (or attend to) the most relevant input data.
The model uses these embeddings to generate three key vectors for each token: query (Q), key (K) and value (V). These values will be used to help the model make the strongest semantic matches within the input sentence.
Matrix multiplications are performed to obtain the query, key and value vectors. The attention mechanism calculates a weighted sum of the values based on the respective query, key and value components’ weight matrices and embedded inputs.1 This process is known as linear transformation.
After the embeddings are transformed, attention scores for each element in the sequence are calculated. The attention scores are obtained by taking the scaled dot product attention scores between the query vectors and key vectors. These attention weights represent how much focus (or attention) a specific token should give to other tokens in a sequence.
Next, the attention score is scaled by the square root of the dimensionality of the key vectors. This process helps to stabilize the gradients and prevent them from growing too large to compute efficiently as the dimensionality of the vectors increases.
The attention scores obtained through the dot product of the query vectors and key vectors are transformed into probabilities by using the softmax function. This process is called normalization.
With these normalized probabilities, the softmax attention block allows the transformer architecture the ability to evaluate the importance of individual input elements during output generation.3 These probabilities are used to find the relative importance of each element in the sequence. The attention model uses these normalized weights to decide which parts of the input to focus on.
Finally, the attention weights derived from this process contribute to the final weighted sum of the value vector. The higher the attention score, the more attention weight the sequence has. This means it will have more influence on the final output of the value vector’s weighted sum.
Attention models are effective at capturing long-range dependencies regardless of the distance between each element, or token, within a sequence. Multiheaded attention is a crucial extension of self-attention that enhances this primary functionality by attending to different elements within the input dataset simultaneously. Models can attend to distinct aspects or relationships in the data at once, allowing for more context to be drawn between dependencies or tokens.
Early bidirectional models such as bidirectional encoder representations from transformers (BERT) improved context understanding by allowing the model to consider information from both the forward and backward sequence. In bidirectional attention, the model aims to understand the meaning of a word based on its surrounding words.4
GPT models popularized self-attention, highlighting the benefits of an expanded context window for generative tasks. The ability to process more information at once leads to improved accuracy and understanding.
AI models use self-attention to process long input sequences efficiently, exchanging attention information at scale while reducing memory usage.5 Self-attention allows the model to gain a deeper contextual understanding by using the context window within the model. The larger the context window, the larger the number of tokens the model can pay attention to at one time.
NLP tasks: The self-attention mechanism enhances the linguistic capabilities of machine learning models by allowing the efficient and complete analysis of an entire text. Research has shown advancements in sentiment classification.6 Models can perform NLP tasks well because the attention layer allows it to compute the relation between words regardless of the distance between them.7
Computer vision: Self-attention mechanisms are not exclusive to NLP tasks. It can be used to focus on specific parts of an image. Developments in image-recognition models suggest that self-attention is a crucial component to increase their robustness and generalization.8
1. “Attention Is All You Need,” Ashish Vaswani et al., Proceedings of the 31st International Conference on Neural Information Processing Systems, arXiv:1706.03762v7, revised on 2 August 2023.
2. “Tokenization,” essay, in Introduction to Information Retrieval, Christopher Manning, Prabhakar Raghavan and Hinrich Schutze, 2008.
3. “Rethinking Softmax: Self-Attention with Polynomial Activations,” Hemanth Saratchandran et al., Australian Institute of Machine Learning, University of Adelaide, arXiv:2410.18613v1, 24 October 2024.
4. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Jacob Devlin et al., arXiv:1810.04805v2, revised on 24 May 2019.
5. “Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective,” Zhiyuan Zeng et al., arXiv:2412.14135, 18 December 2024.
6. “Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification,” Weijiang Li et al., Neurocomputing Vol 387, 28 April 2020.
7. “Parallel Scheduling Self-attention Mechanism: Generalization and Optimization,” Mingfei Yu and Masahiro Fujita, arXiv:2012.01114v1, 2 December 2020.
8. “Exploring Self-attention for Image Recognition,” Hengshuang Zhao, Jiaya Jia and Vladlen Koltun, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
Get started
Get started
Get started
Get started