How BERT and GPT models change the game for NLP

By | 4 minute read | December 3, 2020

AI that can compose Shakespearean sonnets. AI that can design a webpage based on a simple user description. AI that can summarize a description of quantum computing for an eighth grader. Since the launch of the GPT-3 language model this year, natural language processing (NLP) and machine learning enthusiast communities have been abuzz with stories of the new purported capabilities with language-based AI.

Recent advancements with NLP have been a few years in the making, starting in 2018 with the launch of two massive deep learning models: GPT (Generative Pre-Training) by Open AI, and BERT (Bidirectional Encoder Representations from Transformers) for language understanding, including BERT-Base and BERT-Large by Google. Unlike previous NLP models, BERT is an open source and deeply bidirectional and unsupervised language representation, which is pretrained solely using a plain text corpus. Since then we have seen the development of other deep learning massive language models: GPT-2, RoBERT, ESIM+GloVe and now GPT-3, the model that launched a thousand tech articles.

Today’s NLP series blog discusses the BERT and GPT models: what makes these models so powerful and how they can benefit your business.

How massive deep learning models work

Language models estimate the probability of words appearing in a sentence, or of the sentence itself existing. As such, they are useful building blocks in a lot of NLP applications. But they often require a burdensome amount of training data to be useful for specific tasks and domains.

Massive deep learning language models are designed to tackle these pervasive training data issues. They are pretrained using an enormous amount of unannotated data to provide a general-purpose deep learning model. By fine-tuning these pretrained models, downstream users can create task-specific models with smaller annotated training datasets (a technique called transfer learning). These models represent a breakthrough in NLP: now state-of-the-art results can be achieved with smaller training datasets.

Until recently, the state of the art for NLP language models were RNN models. These are useful for sequenced tasks such as abstractive summarization, machine translation and general natural language generation. RNN models process words sequentially, in the order they appear in context, one word at a time. As a result, these models are hard to parallelize and poor at retaining contextual relationships across long text inputs. As we’ve discussed in a previous post, in NLP context is key.

The Transformer, a model introduced in 2017, bypasses these issues. Transformers (such as BERT and GPT) use an attention mechanism, which “pays attention” to the words most useful in predicting the next word in a sentence. With these attention mechanisms, Transformers process an input sequence of words all at once, and they map relevant dependencies between words regardless of how far apart the words appear in the text. As a result, Transformers are highly parallelizable, can train much larger models at a faster rate, and use contextual clues to solve a lot of ambiguity issues that plague text.

Individual Transformers also have their own unique advantages. Until this year, BERT was the most popular deep learning NLP model, achieving state-of-the-art results across many NLP tasks.

Trained on 2.5 billion words, its main advantage is its use of bi-directional learning to gain context of words from both left to right context and right to left context simultaneously, BERT’s bidirectional training approach is optimized for predicting masked words (Masked LM) and outperforms left-to-right training after a small number of pre-training steps. During the model training process, Next Sentence Prediction (NSP) training enables the model to understand how sentences relate to each other, if sentence B should precede or follow sentence A. As a result, it’s able to derive more context. For example, it can understand the semantic meanings of bank in the following sentences: “Raise your oars when you get to the river bank” and “The bank is sending a new debit card.” To understand this, it uses left-to-right river and right-to-left debit card clues.

Unlike BERT models, GPT models are unidirectional. The major advantage of GPT models is the sheer volume of data they were pretrained on: GPT-3, the third-generation GPT model, was trained on 175 billion parameters, about 10 times the size of previous models. This truly massive pretrained model means that users can fine-tune NLP tasks with very little data to accomplish novel tasks. While Transformers in general have reduced the amount of data needed to train models, GPT-3 has the distinct advantage over BERT in that it requires much less data to train models.

For instance, with as few as 10 sentences the model has been taught to write an essay on why humans should not be afraid of AI. (Though, it should be noted, the variable quality of these freeform essays show the limitations of the technology today.)

Tasks executed with BERT and GPT models:

  • Natural language inference is a task performed with NLP that enables models to determine whether a statement is true, false or undetermined based on a premise. For example, if the premise is “tomatoes are sweet” and the statement is “tomatoes are fruit” it might be labelled as undetermined.
  • Question answering enables developers and organizations to create and code question answering systems based on neural networks. In question-answering tasks, the model receives a question regarding text content and returns the answer in text, specifically marking the beginning and end of each answer.
  • Text classification is used for sentiment analysis, spam filtering, news categorization. Use BERT to fine-tune detection of content categories, across any text-classification use case.

The future of massive deep learning models is quite exciting. Research in this area is advancing by leaps and bounds. We expect to see increased progress in the technology and major considerations raised here in the coming months and years. Here at IBM Watson, we will continue to develop, evaluate and incorporate the best of the technology suitable for business cases. Next in the NLP blog series, we’ll explore several key considerations to investigate before embarking on a new model for your business use case.

Most Popular Articles