The earliest types of attention mechanisms all performed what is now categorized as cross-attention. In cross-attention, queries and keys come from different data sources. For instance, in machine translation tasks the keys come from a text corpus in one language and the queries from another language; in speech recognition tasks, queries are audio data and keys are text data to transcribe that audio.
In self-attention, queries, keys and values are all drawn from the same source. Whereas both Bahdanau and Luong’s attention mechanisms were explicitly designed for machine translation, Cheng at al proposed self-attention—which they called “intra-attention”—as a method to improve machine reading in general. Their attention mechanism, outlined in a 2016 paper, explored not how input elements contribute to an overall sequence, but how different input tokens relate to each other.
Consider a language model interpreting the English text
“on Friday, the judge issued a sentence.”
- The preceding word the
suggests that judge
is acting as a noun—as in, person presiding over a legal trial—rather than a verb meaning to appraise or form an opinion.
- That context for the word judge
suggests that sentence
probably refers to a legal penalty, rather than a grammatical “sentence.”
- The word issued
further implies that sentence is referring to the legal concept, not the grammatical concept.
- Therefore, when interpreting the word sentence
, the model should pay close attention to judge
and issued
. It should also pay some attention to the word the
. It can more or less ignore the other words. A well-trained self-attention mechanism would compute attention weights accordingly.
Cheng et al’s paper focused solely on self-attention’s capacity to read and understand text, but it soon followed that modeling intrasequence relationships could also be a powerful tool for writing text. Further development of self-attention, along with the transformer models it enabled, led directly to the advent of modern generative AI and autoregressive LLMs that can generate original text.