What is an encoder-decoder model?

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Joshua Noble

Data Scientist

Encoder-decoder is a type of neural network architecture used for sequential data processing and generation.

In deep learning, the encoder-decoder architecture is a type of neural network most widely associated with the transformer architecture and used in sequence-to-sequence learning. Literature thus refers to encoder-decoders at times as a form of sequence-to-sequence model (seq2seq model). Much machine learning research focuses on encoder-decoder models for natural language processing (NLP) tasks involving large language models (LLMs).

Encoder-decoder models are used to handle sequential data, specifically mapping input sequences to output sequences of different lengths, such as neural machine translation, text summarization, image captioning and speech recognition. In such tasks, mapping a token in the input to one in the output is often indirect. For example, take machine translation: in some languages, the verb appears near the beginning of the sentence (as in English), in others at the end (such as German) and in some, the location of the verb may be more variable (for example, Latin). An encoder-decoder network generates variable length yet contextually appropriate output sequences to correspond to a given input sequence.1

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

The encoder-decoder architecture

As may be inferred from their respective names, the encoder encodes a given input into a vector representation, and the decoder decodes this vector into the same data type as the original input dataset.

Both the encoder and decoder are separate, fully connected neural networks. They may be recurrent neural networks (RNNs)—plus its variants long-short term memory (LSTM), gated recurrent units (GRUs)—and convolutional neural networks (CNNs), as well as transformer models. An encoder-decoder model typically contains several encoders and several decoders.

Each encoder consists of two layers: the self-attention layer (or self-attention mechanism) and the feed-forward neural network. The first layer guides the encoder in surveying and focusing on other related words in a given input as it encodes one specific word therein. The feed-forward neural network further processes encodings so they are acceptable for subsequent encoder or decoder layers.

The decoder part also consists of a self-attention layer and feed-forward neural network, as well as an additional third layer: the encoder-decoder attention layer. This layer focuses network attention on specific parts of the output of the encoder. The multi-head attention layer thereby maps tokens from two different sequences.2

How encoder-decoder models work

Literature widely presents encoder-decoder models as consisting of three components: the encoder, the context vector, and the decoder.3

Encoder

The principal component of the encoder is the self-attention mechanism. The self-attention mechanism determines token weights in a text input to reflect inter-token relationships. In contrast to a traditional word embedding that ignores word order, self-attention processes the whole input text sequence to compute each token’s weighted average embedding that takes into account that token’s distance from all of the other tokens in the text sequence. It computes this average embedding as a linear combination of all embeddings for the input sequence according to following formula:

Here, xj is a given input token at the j-th position in the input text string and xi is the corresponding output token at the i-th position in the input text string. The coefficient wij is the attention weight, which is computed using what is called the softmax function and represents how important is that token in the output text to the corresponding source sequence. In other words, this coefficient signals how much attention the encoder should give to each token in the output text with respect to original token’s importance in the source text.4

The encoder passes this token embedding to the feed-forward layer which adds a positional encoding (or, positional embedding) to the token embedding. This positional encoding accounts for the order of tokens in a text, specifically the distance between tokens. Together, this token embedding and positional embedding comprise the hidden state passed on to the decoder.5

Context vector

Literature widely calls the encoder’s final hidden state the context vector. It is a condensed, numerical representation of the encoder’s initial input text. More simply, it is the embedding and positional encoding produced by the encoder for every word in the input sequence.

Literature often defines the context vector using the following function, in which the context vector X is defined as each token (x) at the i-th position in the input sequence:6

Decoder

Much like the encoder, the decoder is comprised of a self-attention layer and feed-forward network. Between these, the decoder contains a multi-head attention masking layer. This marks the difference between the encoder and decoder. Whereas the encoder generates contextualized token embeddings simultaneously, the decoder’s multi-head attention layer utilizes autoregressive masking.

First, the decoder receives the context vector from the encoder. The decoder uses these positional embeddings to calculate attention scores for each token. These attention scores determine to what degree each token from the input sequence will affect later tokens therein; in other words, the scores determine how much weight each token has in other tokens’ determinations when generating output sequences.

One important feature of this, however, is that the decoder will not use future tokens to determine preceding tokens in that same sequence. Each token’s generated output depends only on the preceding tokens; in other words, when generating a token’s output, the decoder does not consider the next words or tokens after the current one. As is the case with many artificial intelligence techniques, this aims to mimic conventional understandings of how humans process information, specifically language. This approach to information processing is called autoregressive.7

Why use encoder-decoder models in NLP?

One of the foremost advantages of encoder-decoder models for downstream NLP tasks like sentiment analysis or masked language modeling is its production of contextualized embeddings. These embeddings are distinct from fixed word embeddings used in bag of words models.

First, fixed embeddings do not account for word order. They thereby ignore relationships between tokens in a text sequence. Contextualized embeddings, however, account for word order via positional encodings. Moreover, contextualized embeddings attempt to capture the relationship between tokens through the attention mechanism that considers the distance between tokens in a given sequence when producing the embeddings.

Fixed embeddings generate one embedding for a given token, conflating all instances of that token. Encoder-decoder models produce contextualized embeddings for each token instance of a token. As a result, contextualized embeddings more adeptly handle polysemous words—that is, words with multiple meanings. For example, flies may signify an action or an insect. A fixed word embedding collapses this word’s multiple significations by creating a single embedding for the token or word. But an encoder-decoder model generates individual contextualized embeddings for every occurrence of the word flies, and so captures is myriad significations through multiple distinct embeddings.8

Mixture of Experts | 5 December, episode 84

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Types of encoder-decoder variants

As may be expected, the encoder-decoder architecture has many variants, each with their own primary use cases in data science and machine learning.

Encoder-only. These models (also described as auto-encoders) use only the encoder stack, eschewing decoders. Such models thus lack autoregressive masked modeling and have access to all the tokens in the initial input text. As such, these models are described as bi-directional, as they use all the surrounding tokens–both preceding and succeeding—to make predictions for a given token. Well-known encoder models are the BERT family of models, such as BERT,9 RoBERTa,10 and ELECTRA,11 as well as the IBM Slate models. Encoder-only models are often utilized for tasks that necessitate understanding a whole text input, such as text classification or named entity recognition.

Decoder-only. These models (also called autoregressive models) use only the decoder stack, foregoing any encoders. Thus, when making token predictions, the model’s attention layers can only access those tokens preceding the token under consideration. Decoder-only models are often used for text generation tasks like question answering, code writing, or chatbots such as ChatGPT. An example of a decoder-only model is the IBM granite family of foundational models.12

Related solutions
IBM® watsonx Orchestrate™ 

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Whether you choose to customize pre-built apps and skills or build and deploy custom agentic services using an AI studio, the IBM watsonx platform has you covered.

Explore watsonx Orchestrate Explore watsonx.ai
Footnotes

Jurafsky, D. and Martin, J.,  “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”, Third edition, 2023.

Telmo, P., Lopes, A. V., Assogba, Y. and Setiawan, H. “One Wide Feedforward Is All You Need” , 2023.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin I. “Attention Is All You Need”, 2017.
Tunstall, L., Werra, L. and Wolf and T. “Natural Language Processing with Transformers”, Revised Edition, O’Reilly, 2022

3 Goodfellow, I., Bengio, Y. and Courville, A. “Deep Learning”, MIT Press, 2016.
Jurafsky, D. and Martin, J. “Speech and Language Processing”, Third Edition, 2023.
Tunstall, L., Werra, L. and Wolf and T. “Natural Language Processing with Transformers”, Revised Edition, O’Reilly, 2022.

4 Tunstall, L., Werra, L. and Wolf and T. “Natural Language Processing with Transformers”, Revised Edition, O’Reilly, 2022.
Goldberg, Y. “Neural network methods for Natural Language Processing”, Springer, 2022.

5 Alammar, J. and Grootendorst, M. “Hands-on Large Language Models”, O’Reilly, 2024.

6
Goodfellow, I., Bengio, Y. and Courville, A. “Deep Learning”, MIT Press, 2016.
Jurafsky, D. and Martin, J. “Speech and Language Processing”, Third Edition, 2023.

7 Foster, D. “Generative Deep Learning”, Second Edition, O’Reilly, 2023.
Rothman, D. “Transformers for Natural Language Processing”, Second Edition, 2022. 
Jurafsky, D. and Martin, J. Speech and Language Processing”, Third Edition, 2023.

8 Tunstall, L., Werra, L. and Wolf and T. “Natural Language Processing with Transformers”, Revised Edition, O’Reilly, 2022. 

9 Devlin, J. et all. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2019.

10 Liu, Y., Ott, M., Goyal, N., Du, J.,  Joshi, M., Chen,  D., Levy, O., Lewis, M. ,  Zettlemoyer,  L.  and Stoyanov, V. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”,  2019.

11 Clark, K. et all. “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators”,  2020.

12 Mayank, M. et all. “Granite Code Models: A Family of Open Foundation Models for Code Intelligence” 2024.
Ruiz, A. “IBM Granite Large Language Models Whitepaper” 2024.