What is an encoder-decoder model?

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Data Scientist

Encoder-decoder is a type of neural network architecture used for sequential data processing and generation.

In deep learning, the encoder-decoder architecture is a type of neural network most widely associated with the transformer architecture and used in sequence-to-sequence learning. Literature thus refers to encoder-decoders at times as a form of sequence-to-sequence model (seq2seq model). Much machine learning research focuses on encoder-decoder models for natural language processing (NLP) tasks involving large language models (LLMs).

Encoder-decoder models are used to handle sequential data, specifically mapping input sequences to output sequences of different lengths, such as neural machine translation, text summarization, image captioning and speech recognition. In such tasks, mapping a token in the input to one in the output is often indirect. For example, take machine translation: in some languages, the verb appears near the beginning of the sentence (as in English), in others at the end (such as German) and in some, the location of the verb may be more variable (for example, Latin). An encoder-decoder network generates variable length yet contextually appropriate output sequences to correspond to a given input sequence.¹

The encoder-decoder architecture

As may be inferred from their respective names, the encoder encodes a given input into a vector representation, and the decoder decodes this vector into the same data type as the original input dataset.

Both the encoder and decoder are separate, fully connected neural networks. They may be recurrent neural networks (RNNs)—plus its variants long-short term memory (LSTM), gated recurrent units (GRUs)—and convolutional neural networks (CNNs), as well as transformer models. An encoder-decoder model typically contains several encoders and several decoders.

Each encoder consists of two layers: the self-attention layer (or self-attention mechanism) and the feed-forward neural network. The first layer guides the encoder in surveying and focusing on other related words in a given input as it encodes one specific word therein. The feed-forward neural network further processes encodings so they are acceptable for subsequent encoder or decoder layers.

The decoder part also consists of a self-attention layer and feed-forward neural network, as well as an additional third layer: the encoder-decoder attention layer. This layer focuses network attention on specific parts of the output of the encoder. The multi-head attention layer thereby maps tokens from two different sequences.²

How encoder-decoder models work

Literature widely presents encoder-decoder models as consisting of three components: the encoder, the context vector, and the decoder.³

Encoder

The principal component of the encoder is the self-attention mechanism. The self-attention mechanism determines token weights in a text input to reflect inter-token relationships. In contrast to a traditional word embedding that ignores word order, self-attention processes the whole input text sequence to compute each token’s weighted average embedding that takes into account that token’s distance from all of the other tokens in the text sequence. It computes this average embedding as a linear combination of all embeddings for the input sequence according to following formula:

Here, x_j is a given input token at the j-th position in the input text string and x_i is the corresponding output token at the i-th position in the input text string. The coefficient w_ij is the attention weight, which is computed using what is called the softmax function and represents how important is that token in the output text to the corresponding source sequence. In other words, this coefficient signals how much attention the encoder should give to each token in the output text with respect to original token’s importance in the source text.⁴

The encoder passes this token embedding to the feed-forward layer which adds a positional encoding (or, positional embedding) to the token embedding. This positional encoding accounts for the order of tokens in a text, specifically the distance between tokens. Together, this token embedding and positional embedding comprise the hidden state passed on to the decoder.⁵

Context vector

Literature widely calls the encoder’s final hidden state the context vector. It is a condensed, numerical representation of the encoder’s initial input text. More simply, it is the embedding and positional encoding produced by the encoder for every word in the input sequence.

Literature often defines the context vector using the following function, in which the context vector X is defined as each token (x) at the i-th position in the input sequence:⁶

Decoder

Much like the encoder, the decoder is comprised of a self-attention layer and feed-forward network. Between these, the decoder contains a multi-head attention masking layer. This marks the difference between the encoder and decoder. Whereas the encoder generates contextualized token embeddings simultaneously, the decoder’s multi-head attention layer utilizes autoregressive masking.

First, the decoder receives the context vector from the encoder. The decoder uses these positional embeddings to calculate attention scores for each token. These attention scores determine to what degree each token from the input sequence will affect later tokens therein; in other words, the scores determine how much weight each token has in other tokens’ determinations when generating output sequences.

One important feature of this, however, is that the decoder will not use future tokens to determine preceding tokens in that same sequence. Each token’s generated output depends only on the preceding tokens; in other words, when generating a token’s output, the decoder does not consider the next words or tokens after the current one. As is the case with many artificial intelligence techniques, this aims to mimic conventional understandings of how humans process information, specifically language. This approach to information processing is called autoregressive.⁷

Why use encoder-decoder models in NLP?

One of the foremost advantages of encoder-decoder models for downstream NLP tasks like sentiment analysis or masked language modeling is its production of contextualized embeddings. These embeddings are distinct from fixed word embeddings used in bag of words models.

First, fixed embeddings do not account for word order. They thereby ignore relationships between tokens in a text sequence. Contextualized embeddings, however, account for word order via positional encodings. Moreover, contextualized embeddings attempt to capture the relationship between tokens through the attention mechanism that considers the distance between tokens in a given sequence when producing the embeddings.

Fixed embeddings generate one embedding for a given token, conflating all instances of that token. Encoder-decoder models produce contextualized embeddings for each token instance of a token. As a result, contextualized embeddings more adeptly handle polysemous words—that is, words with multiple meanings. For example, flies may signify an action or an insect. A fixed word embedding collapses this word’s multiple significations by creating a single embedding for the token or word. But an encoder-decoder model generates individual contextualized embeddings for every occurrence of the word flies, and so captures is myriad significations through multiple distinct embeddings.⁸

Types of encoder-decoder variants

As may be expected, the encoder-decoder architecture has many variants, each with their own primary use cases in data science and machine learning.

Encoder-only. These models (also described as auto-encoders) use only the encoder stack, eschewing decoders. Such models thus lack autoregressive masked modeling and have access to all the tokens in the initial input text. As such, these models are described as bi-directional, as they use all the surrounding tokens–both preceding and succeeding—to make predictions for a given token. Well-known encoder models are the BERT family of models, such as BERT,⁹ RoBERTa,¹⁰ and ELECTRA,¹¹ as well as the IBM Slate models. Encoder-only models are often utilized for tasks that necessitate understanding a whole text input, such as text classification or named entity recognition.

Decoder-only. These models (also called autoregressive models) use only the decoder stack, foregoing any encoders. Thus, when making token predictions, the model’s attention layers can only access those tokens preceding the token under consideration. Decoder-only models are often used for text generation tasks like question answering, code writing, or chatbots such as ChatGPT. An example of a decoder-only model is the IBM granite family of foundational models.¹²

Footnotes

[1] Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023, https://web.stanford.edu/~jurafsky/slp3/ (link resides outside ibm.com).

[2] Telmo Pires, António Vilarinho Lopes, Yannick Assogba, Hendra Setiawan, "One Wide Feedforward Is All You Need," Proceedings of the Eighth Conference on Machine Translation, 2023, https://aclanthology.org/2023.wmt-1.98/ (link resides outside ibm.com). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, "Attention Is All You Need," Advances in Neural Information Processing Systems, 2017, https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (link resides outside of ibm.com). Lewis Tunstall, Leandro von Werra, and Thomas Wolf, Natural Language Processing with Transformers, Revised Edition, O’Reilly, 2022.

[3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016. Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023, https://web.stanford.edu/~jurafsky/slp3/ (link resides outside ibm.com). Lewis Tunstall, Leandro von Werra, and Thomas Wolf, Natural Language Processing with Transformers, Revised Edition, O’Reilly, 2022.

[4] Lewis Tunstall, Leandro von Werra, and Thomas Wolf, Natural Language Processing with Transformers, Revised Edition, O’Reilly, 2022. Yoav Goldberg, Neural Network Methods for Natural Language Processing, Springer, 2022.

[5] Jay Alammar and Maarten Grootendorst, Hands-on Large Language Models, O’Reilly, 2024.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016. Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023, https://web.stanford.edu/~jurafsky/slp3/ (link resides outside ibm.com).

[7] David Foster, Generative Deep Learning, 2nd Edition, O’Reilly, 2023. Denis Rothman, Transformers for Natural Language Processing, 2nd Edition, Packt Publishing, 2022. Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023, https://web.stanford.edu/~jurafsky/slp3/ (link resides outside ibm.com).

[8] Lewis Tunstall, Leandro von Werra, and Thomas Wolf, Natural Language Processing with Transformers, Revised Edition, O’Reilly, 2022.

[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, https://aclanthology.org/N19-1423/ (link resides outside ibm.com).

[10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, "RoBERTa: A Robustly Optimized BERT Pretraining Approach," 2019, https://arxiv.org/abs/1907.11692 (link resides outside ibm.com).

[11] Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning, , "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators," 2020, https://arxiv.org/abs/2003.10555 (link resides outside ibm.com).

[12] Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, Manish Sethi, Xuan-Hong Dang, Pengyuan Li, Kun-Lung Wu, Syed Zawad, Andrew Coleman, Matthew White, Mark Lewis, Raju Pavuluri, Yan Koyfman, Boris Lublinsky, Maximilien de Bayser, Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Yi Zhou, Chris Johnson, Aanchal Goyal, Hima Patel, Yousaf Shah, Petros Zerfos, Heiko Ludwig, Asim Munawar, Maxwell Crouse, Pavan Kapanipathi, Shweta Salaria, Bob Calio, Sophia Wen, Seetharami Seelam, Brian Belgodere, Carlos Fonseca, Amith Singhee, Nirmit Desai, David D. Cox, Ruchir Puri, Rameswar Panda, ""Granite Code Models: A Family of Open Foundation Models for Code Intelligence," 2024, https://arxiv.org/abs/2405.04324 (link resides outside ibm.com). Armand Ruiz, “IBM Granite Large Language Models Whitepaper,” 2024, https://community.ibm.com/community/user/watsonx/blogs/armand-ruiz-gabernet/2024/06/24/ibm-granite-large-language-models-whitepaper.