My IBM Log in Subscribe

What are masked language models?

29 August 2024

 

 

Contributors

Jacob Murel Ph.D.

Senior Technical Content Creator

Eda Kavlakoglu

Program Manager

Masked language modeling trains models to predict missing words in text. It typically pretrains models for downstream NLP tasks.

Masked language models (MLM) are a type of large language model (LLM) used to help predict missing words from text in natural language processing (NLP) tasks. By extension, masked language modeling is one form of training transformer models—notably bidirectional encoder representations from transformers (BERT) and its derivative robustly optimized BERT pretraining approach (RoBERTa)—for NLP tasks by training the model to fill in masked words within a text, and thereby predict the most likely and coherent words to complete the text.1

Masked language modeling aids many tasks—from sentiment analysis to text generation—by training a model to understand the contextual relationship between words. In fact, research developers often use masked language modeling to create pretrained models that undergo further supervised fine-tuning for downstream tasks, such as text classification or machine translation. Masked language models thereby undergird many current state-of-the-art language modeling algorithms. Although masked language modeling is a method for pretraining language models, online sources sometimes refer to it as a transfer learning method. This might not be unjustified as some research groups have begun to implement masked language modeling as an end-task in itself.

The HuggingFace transformers and Tensorflow text libraries contain functions designed to train and test masked language models in Python, both as end-tasks and for downstream tasks.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

How masked language models work

The general procedure characterizing masked language models is fairly straightforward. Being a form of unsupervised learning, masked language modeling begins with a large and unannotated text dataset. The algorithm replaces a random sample of words from this input text with masked tokens, which can consist of the token [MASK] or other word tokens from the input text vocabulary. For each masked token, the model then predicts which word tokens are the most likely to have appeared in the original input text.2

For example, in the following sentence from Shakespeare’s Othello, two words have been replaced with masked tokens while another word has been replaced with an entirely different word token:

The model will then train a bidirectional encoder to predict the original input tokens that have been masked. How does it do this? Admittedly, elucidating the inner machinations of masked language models requires a foundation in advanced algebra and machine learning. Nevertheless, a cursory overview is possible.

For every word token in the input text data, the model generates word embeddings similar to a bag of words model. The model combines these word embeddings with positional encodings to create the transformer’s input. Positional encodings, in brief, represent the location of a given word token in a sequence using a unique vector value. Through positional encodings (or positional embeddings), the model can capture semantic information about words through their positional relationships with other words.

 

The transformer model then uses these word and positional embeddings to generate probability distributions over the input vocabulary for each of the masked tokens. The words with the highest predicted probability for each masked token are the model’s respective predictions for each token’s true value.3

 

Approaches to masked token prediction

Masked language modeling is a characteristic feature of the BERT transformer model pretraining—indeed, the two were introduced to the machine learning community together. Prior to BERT, language models were unidirectional. This means they learned language representations by considering only the text that precedes a given word. BERT’s approach to a masked language modeling task, however, considers both the preceding and succeeding text.4 The primary difference between unidirectional and bidirectional approaches depends on how the transformer’s self-attention layer decodes output values.

When predicting the next word in a sequence—or in our case, the missing word—a unidirectional model considers only those words that precede the missing value. Transformer decoders that operate this way are also called causal or backward-looking. When processing an input sequence, the decoder only considers those inputs up to and including the input token in question; the decoder does not have access to token inputs following the one under consideration. By contrast, a bidirectional encoder, as adopted in the BERT model, generates predictions using all the input tokens, those that both precede and follow the masked value.5

To illustrate, let’s return to the aforementioned Othello quote: “But I do think it is their husbands’ faults if wives do fall.” Imagine that, for some reason, we have this whole text except for the word wives: “But I do think it is their husbands’ faults if ________ do fall.” We want to determine what fills this gap. This figure illustrates the difference in how both decoders would process our example sentence:

In this figure, y signifies the predicted output for the masked token. The unidirectional transformer uses only those input values preceding the masked token to predict the latter’s value. The bidirectional transformer, however, uses positional embeddings from all of the input values—both those that precede and follow the mask—in order to predict the masked token’s value.

Recent research

Developers and researchers use masked language models to power many NLP tasks, such as named entity recognition, question answering and text classification. As with many domains of NLP, masked language modeling research has often focused on Latinate languages, and principally English. More recently, published experiments develop and evaluate datasets of non-Latinate languages, such as Japanese and Russian, for masked language modeling and downstream tasks.6 Additionally, one research group proposes a weakly supervised method for pretraining multilingual masked language models. Specifically, they introduce a special masked token to enact a cross-lingual forward pass in pretraining on multilingual datasets. Their method shows marked improvement in cross-lingual classification with multilingual masked language models.7

AI Academy

Why foundation models are a paradigm shift for AI

Learn about a new class of flexible, reusable AI models that can unlock new revenue, reduce costs and increase productivity, then use our guidebook to dive deeper.

Use cases

As mentioned, researchers might often use masked language modeling as a means to improve model performance on downstream NLP tasks. Such tasks include:

Named entity recognition. This task uses models and neural networks to identify predefined object categories in texts—such as person names, city names, and so forth. As with many machine learning objectives, lack of suitable data has proven a hurdle in named entity recognition. To address this, researchers have explored masked language modeling as a form of data augmentation for named entity recognition with notable success.8

Sentiment analysis. Sentiment analysis analyzes and classifies data as positive, negative or neutral. It is often used for classifying large collections of online customer reviews. Similar to named entity recognition, researchers have explored masked language modeling as a data augmentation technique for sentiment analysis.9 Moreover, masked language modeling shows promise for domain adaptation in sentiment analysis. Research specifically suggests it helps focus on predicting words with large weights for sentiment classifier tasks.10

Related solutions

Related solutions

Foundation models

Explore Granite 3.2 and the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.

Explore watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Explore the IBM library of foundation models in the IBM watsonx portfolio to scale generative AI for your business with confidence.

Explore watsonx.ai Explore AI solutions
Footnotes

1 Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023, https://web.stanford.edu/~jurafsky/slp3.

2 Lewis Tunstall, Leandro von Werra, and Thomas Wolf, Natural Language Processing with Transformers, Revised Edition, O’Reilly Media, 2022.

3 Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023, https://web.stanford.edu/~jurafsky/slp3. Denis Rothman, Transformers for Natural Language Processing and Computer Vision, 3rd edition, Packt Publishing, 2024.

4 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, https://aclanthology.org/N19-1423.

5 Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023, https://web.stanford.edu/~jurafsky/slp3.

6 Masahiro Kaneko, Aizhan Imankulova, Danushka Bollegala, and Naoaki Okazaki, "Gender Bias in Masked Language Models for Multiple Languages," Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, https://aclanthology.org/2022.naacl-main.197. Sheng Liang, Philipp Dufter, and Hinrich Schütze, "Monolingual and Multilingual Reduction of Gender Bias in Contextualized Representations," Proceedings of the 28th International Conference on Computational Linguistics, 2020, https://aclanthology.org/2020.coling-main.446.

7 Xi Ai and Bin Fang, "On-the-fly Cross-lingual Masking for Multilingual Pre-training," Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023, https://aclanthology.org/2023.acl-long.49.

8 Ran Zhou, Xin Li, Ruidan He, Lidong Bing, Erik Cambria, Luo Si, and Chunyan Miao, "MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER," Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022, https://aclanthology.org/2022.acl-long.160.

9 Larisa Kolesnichenko, Erik Velldal, and Lilja Øvrelid, "Word Substitution with Masked Language Models as Data Augmentation for Sentiment Analysis,"Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), 2023, https://aclanthology.org/2023.resourceful-1.6.

10 Nikolay Arefyev, Dmitrii Kharchev, and Artem Shelmanov, "NB-MLM: Efficient Domain Adaptation of Masked Language Models for Sentiment Analysis," Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, https://aclanthology.org/2021.emnlp-main.717.