Much like the encoder, the decoder is comprised of a self-attention layer and feed-forward network. Between these, the decoder contains a multi-head attention masking layer. This marks the difference between the encoder and decoder. Whereas the encoder generates contextualized token embeddings simultaneously, the decoder’s multi-head attention layer utilizes autoregressive masking.
First, the decoder receives the context vector from the encoder. The decoder uses these positional embeddings to calculate attention scores for each token. These attention scores determine to what degree each token from the input sequence will affect later tokens therein; in other words, the scores determine how much weight each token has in other tokens’ determinations when generating output sequences.
One important feature of this, however, is that the decoder will not use future tokens to determine preceding tokens in that same sequence. Each token’s generated output depends only on the preceding tokens; in other words, when generating a token’s output, the decoder does not consider the next words or tokens after the current one. As is the case with many artificial intelligence techniques, this aims to mimic conventional understandings of how humans process information, specifically language. This approach to information processing is called autoregressive.7