A transformer model is a type of deep learning model that was introduced in 2017. These models have quickly become fundamental in natural language processing (NLP), and have been applied to a wide range of tasks in machine learning and artificial intelligence.
The model was first described in a 2017 paper called "Attention is All You Need" by Ashish Vaswani, a team at Google Brain, and a group from the University of Toronto. The release of this paper is considered a watershed moment in the field, given how widespread transformers are now used in applications such as training LLMs.
These models can translate text and speech in near-real-time. For example, there are apps that now allow tourists to communicate with locals on the street in their primary language. They help researchers better understand DNA and speed up drug design. They can hep detect anomalies and prevent fraud in finance and security. Vision transformers are similarly used for computer vision tasks.
OpenAI’s popular ChatGPT text generation tool makes use of transformer architectures for prediction, summarization, question answering and more, because they allow the model to focus on the most relevant segments of input text. The “GPT” seen in the tool’s various versions (e.g. GPT-2, GPT-3) stands for “generative pre-trained transformer.” Text-based generative AI tools such as ChatGPT benefit from transformer models because they can more readily predict the next word in a sequence of text, based on a large, complex data sets.
The BERT model, or Bidirectional Encoder Representations from Transformers, is based on the transformer architecture. As of 2019, BERT was used for nearly all English-language Google search results, and has been rolled out to over 70 other languages.1
Discover the power of integrating a data lakehouse strategy into your data architecture, including enhancements to scale AI and cost optimization opportunities.
Register for the ebook on generative AI
The key innovation of the transformer model is not having to rely on recurrent neural networks (RNNs) or convolutional neural networks (CNNs), neural network approaches which have significant drawbacks. Transformers process input sequences in parallel, making it highly efficient for training and inference — because you can’t just speed things up by adding more GPUs. Transformer models need less training time than previous recurrent neural network architectures such as long short-term memory (LSTM).
RNNs and LSTM date back to the 1920s and 1990s, respectively. These techniques compute each component of an input in sequence (e.g. word by word), so computation can take a long time. What’s more, both approaches run into limitations in retaining context when the “distance” between pieces of information in an input is long.
There are two primary innovations that transformer models bring to the table. Consider these two innovations within the context of predicting text.
Positional encoding: Instead of looking at each word in the order that it appears in a sentence, a unique number is assigned to each word. This provides information about the position of each token (parts of the input such as words or subword pieces in NLP) in the sequence, allowing the model to consider the sequence's sequential information.
Self-attention: Attention is a mechanism that calculates weights for every word in a sentence as they relate to every other word in the sentence, so the model can predict words which are likely to be used in sequence. This understanding is learned over time as a model is trained on lots of data. The self-attention mechanism allows each word to attend to every other word in the sequence in parallel, weighing their importance for the current token. In this way, it can be said that machine learning models can “learn” the rules of grammar, based on statistical probabilities of how words are typically used in language.
Transformer models work by processing input data, which can be sequences of tokens or other structured data, through a series of layers that contain self-attention mechanisms and feedforward neural networks. The core idea behind how transformer models work can be broken down into several key steps.
Let’s imagine that you need to convert an English sentence into French. These are the steps you’d need to take to accomplish this task with a transformer model.
Input embeddings: The input sentence is first transformed into numerical representations called embeddings. These capture the semantic meaning of the tokens in the input sequence. For sequences of words, these embeddings can be learned during training or obtained from pre-trained word embeddings.
Positional encoding: Positional encoding is typically introduced as a set of additional values or vectors that are added to the token embeddings before feeding them into the transformer model. These positional encodings have specific patterns that encode the position information.
Multi-head attention: Self-attention operates in multiple "attention heads" to capture different types of relationships between tokens. Softmax functions, a type of activation function, are used to calculate attention weights in the self-attention mechanism.
Layer normalization and residual connections: The model uses layer normalization and residual connections to stabilize and speed up training.
Feedforward neural networks: The output of the self-attention layer is passed through feedforward layers. These networks apply non-linear transformations to the token representations, allowing the model to capture complex patterns and relationships in the data.
Stacked layers: Transformers typically consist of multiple layers stacked on top of each other. Each layer processes the output of the previous layer, gradually refining the representations. Stacking multiple layers enables the model to capture hierarchical and abstract features in the data.
Output layer: In sequence-to-sequence tasks like neural machine translation, a separate decoder module can be added on top of the encoder to generate the output sequence.
Training: Transformer models are trained using supervised learning, where they learn to minimize a loss function that quantifies the difference between the model's predictions and the ground truth for the given task. Training typically involves optimization techniques like Adam or stochastic gradient descent (SGD).
Inference: After training, the model can be used for inference on new data. During inference, the input sequence is passed through the pre-trained model, and the model generates predictions or representations for the given task.
Scale always-on, high-performance analytics and AI workloads on governed data across your organization.
IBM® watsonx.data is a fit-for-purpose data store built on open lakehouse architecture and supported by querying, governance and open data formats to help access and share data.
Granite is IBM's flagship series of LLM foundation models based on decoder-only transformer architecture. Granite language models are trained on trusted enterprise data spanning internet, academic, code, legal and finance.
Find out more about IBM® watsonx.data, a data store that helps enterprises easily unify and govern their structured and unstructured data.
Explore open data lakehouse architecture and find out how it combines the flexibility, and cost advantages of data lakes with the performance of data warehouses.
Discover how IBM® watsonx.data helps enterprises address the challenges of today’s complex data landscape and scale AI to suit their needs.
See how Presto, a fast and flexible open-source SQL query engine can help deliver the insights enterprises need.
1 Google’s BERT Rolls Out Worldwide (link resides outside ibm.com), Search Engine Journal, Dec 9, 2019