In the attention layer, the Q, K and V vectors are used to calculate an alignment score between each token at each position in a sequence. Those alignment scores are then normalized into attention weights using a softmax function.
For each token x in a sequence, alignment scores are calculated by computing the dot product of that token’s query vector Qx with the key vector K of each of the other tokens: in other words, by multiplying them together. If a meaningful relationship between 2 tokens is reflected in similarities between their respective vectors, multiplying them together will yield a large value. If the 2 vectors aren’t aligned, multiplying them together will yield a small or negative value. Most transformer models use a variant called scaled dot product attention, in which QK is scaled—that is, multiplied—by to improve training stability.
These query-key alignment scores are then typed in to a softmax function. Softmax normalizes all inputs to a value between 0 and 1 such that they all add up to 1. The outputs of the softmax function are the attention weights, each representing the share (out of 1) of token x’s attention to be paid to each of the other tokens. If a token’s attention weight is close to 0, it will be ignored. An attention weight of 1 would mean that a token receives x’s entire attention and all others will be ignored.
Finally, the value vector for each token is multiplied by its attention weight. These attention-weighted contributions from each previous token are averaged together and added to the original vector embedding for token x. With this, token x’s embedding is now updated to reflect the context provided by the other tokens in the sequence that are relevant to it.
The updated vector embedding is then sent to another linear layer, with its own weight matrix WZ, where the context-updated vector is normalized back to a consistent number of dimensions and then sent to the next attention layer. Each progressive attention layer captures greater contextual nuance.