What Are Word Embeddings?

Authors

Joel Barnard

Writer

What are word embeddings?

Word embeddings are a way of representing words as vectors in a multi-dimensional space, where the distance and direction between vectors reflect the similarity and relationships among the corresponding words.

The development of embedding to represent text has played a crucial role in advancing natural language processing (NLP) and machine learning (ML) applications. Word embeddings have become integral to tasks such as text classification, sentiment analysis, machine translation and more.

Traditional methods of representing words in a way that machines can understand, such as one-hot encoding, represent each word as a sparse vector with a dimension equal to the size of the vocabulary. Here, only one element of the vector is "hot" (set to 1) to indicate the presence of that word. While simple, this approach suffers from the curse of dimensionality, lacks semantic information and doesn't capture relationships between words.

Word embeddings, on the other hand, are dense vectors with continuous values that are trained using machine learning techniques, often based on neural networks. The idea is to learn representations that encode semantic meaning and relationships between words. Word embeddings are trained by exposing a model to a large amount of text data and adjusting the vector representations based on the context in which words appear.

One popular method for training word embeddings is Word2Vec, which uses a neural network to predict the surrounding words of a target word in a given context. Another widely used approach is GloVe (Global Vectors for Word Representation), which leverages global statistics to create embeddings.

Word embeddings have proven invaluable for NLP tasks, as they allow machine learning algorithms to understand and process the semantic relationships between words in a more nuanced way compared to traditional methods.

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

How word embeddings are used

Word embeddings are used in a variety of NLP tasks to enhance the representation of words and capture semantic relationships, including:

Text classification

Word embeddings are often used as features in text classification tasks, such as sentiment analysis, spam detection and topic categorization.

Named Entity Recognition (NER)

To accurately identify and classify entities (e.g., names of people, organizations, locations) in text, word embeddings help the model understand the context and relationships between words.

Machine translation

In machine translation systems, word embeddings help represent words in a language-agnostic way, allowing the model to better understand the semantic relationships between words in the source and target languages.

Information retrieval

In information retrieval systems, word embeddings can enable more accurate matching of user queries with relevant documents, which improves the effectiveness of search engines and recommendation systems.

Question answering

Word embeddings contribute to the success of question answering systems by enhancing the understanding of the context in which questions are posed and answers are found.

Semantic similarity and clustering

Word embeddings enable measuring semantic similarity between words or documents for tasks like clustering related articles, finding similar documents or recommending similar items based on their textual content.

Text generation

In text generation tasks, such as language modeling and autoencoders, word embeddings are often used to represent the input text and generate coherent and contextually relevant output sequences.

Similarity and analogy

Word embeddings can be used to perform word similarity tasks (e.g., finding words similar to a given word) and word analogy tasks (e.g., "king" is to "queen" as "man" is to "woman").

Pre-training models

Pre-trained word embeddings serve as a foundation for pre-training more advanced language representation models, such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer).

Mixture of Experts | 5 December, episode 84

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

A brief history of word embeddings

In the 2000s, researchers began exploring neural language models (NLMs), which use neural networks to model the relationships between words in a continuous space. These early models laid the foundation for the later development of word embeddings.

Bengio et al. (2003) introduced feedforward neural networks for language modeling. These models were capable of capturing distributed representations of words, but they were limited in their ability to handle large vocabularies.

Researchers, including Mnih and Hinton (2009), explored probabilistic models for learning distributed representations of words. These models focused on capturing semantic relationships between words and were an important step toward word embeddings.

The Word2Vec model, introduced by Tomas Mikolov and his colleagues at Google in 2013, marked a significant breakthrough. Word2Vec leverages two models, Continuous Bag of Words (CBOW) and Continuous Skip-gram, which efficiently learn word embeddings from large corpora and have become widely adopted due to their simplicity and effectiveness.

GloVe (Global Vectors for Word Representation), introduced by Pennington et al. in 2014, is based on the idea of using global statistics (word co-occurrence frequencies) to learn vector representations for words. It has been used in various NLP applications and is known for its ability to capture semantic relationships.

Today, with the rise of deep learning, embedding layers have become a standard component of neural network architectures for NLP tasks. Embeddings are now used not only for words but also for entities, phrases and other linguistic units. In large part, word embeddings have allowed language models like recurrent neural networks (RNNs), long short-term memory (LSTM) networks, Embeddings from Language Models (ELMo), BERT, ALBERT (a light BERT) and GPT to evolve at such a blistering pace.

How word embeddings are created

The primary goal of word embeddings is to represent words in a way that captures their semantic relationships and contextual information. These vectors are numerical representations in a continuous vector space, where the relative positions of vectors reflect the semantic similarities and relationships between words.

The reason vectors are used to represent words is that most machine learning algorithms, including neural networks, are incapable of processing plain text in its raw form. They require numbers as inputs to perform any task.

The process of creating word embeddings involves training a model on a large corpus of text (e.g., Wikipedia or Google News). The corpus is preprocessed by tokenizing the text into words, removing stop words and punctuation and performing other text-cleaning tasks.

A sliding context window is applied to the text, and for each target word, the surrounding words within the window are considered as context words. The word embedding model is trained to predict a target word based on its context words or vice versa.

This allows models to capture diverse linguistic patterns and assign each word a unique vector, which represents the word's position in a continuous vector space. Words with similar meanings are positioned close to each other, and the distance and direction between vectors encode the degree of similarity.

The training process involves adjusting the parameters of the embedding model to minimize the difference between predicted and actual words in context.

Here's a simplified example of word embeddings for a very small corpus (6 words), where each word is represented as a 3-dimensional vector:

cat          [0.2, -0.4, 0.7]
    dog         [0.6, 0.1, 0.5]
    apple      [0.8, -0.2, -0.3]
    orange    [0.7, -0.1, -0.6]
    happy    [-0.5, 0.9, 0.2]
    sad         [0.4, -0.7, -0.5]

In this example, each word (e.g., "cat," "dog," "apple") is associated with a unique vector. The values in the vector represent the word's position in a continuous 3-dimensional vector space. Words with similar meanings or contexts are expected to have similar vector representations. For instance, the vectors for "cat" and "dog" are close together, reflecting their semantic relationship. Likewise, the vectors for "happy" and "sad" have opposite directions, indicating their contrasting meanings.

The example above is highly simplified for illustration purposes. Actual word embeddings typically have hundreds of dimensions to capture more intricate relationships and nuances in meaning.

Foundational aspects of word embeddings

Word embeddings have become a fundamental tool in NLP, providing a foundation for understanding and representing language in a way that aligns with the underlying semantics of words and phrases.

Below are some of the key concepts and developments that have made using word embeddings such a powerful technique in helping advance NLP.

Distributional Hypothesis

The Distributional Hypothesis posits that words with similar meanings tend to occur in similar contexts. This concept forms the basis for many word embedding models, as they aim to capture semantic relationships by analyzing patterns of word co-occurrence.

Dimensionality reduction

Unlike traditional one-hot encoding, word embeddings are dense vectors of lower dimensionality. This reduces the computational complexity and memory requirements, making them suitable for large-scale NLP applications.

Semantic representation

Word embeddings capture semantic relationships between words, allowing models to understand and represent words in a continuous vector space where similar words are close to each other. This semantic representation enables more nuanced understanding of language.

Contextual information

Word embeddings capture contextual information by considering the words that co-occur in a given context. This helps models understand the meaning of a word based on its surrounding words, leading to better representation of phrases and sentences.

Generalization

Word embeddings generalize well to unseen words or rare words because they learn to represent words based on their context. This is particularly advantageous when working with diverse and evolving vocabularies.

Two approaches to word embeddings

Frequency-based and prediction-based embedding methods represent two broad categories of approaches in the context of word embeddings. These methods mainly differ in how they generate vector representations for words.

Frequency-based embeddings

Frequency-based embeddings refer to word representations that are derived from the frequency of words in a corpus. These embeddings are based on the idea that the importance or significance of a word can be inferred from how frequently it occurs in the text.

One example of frequency-based embeddings is Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF is designed to highlight words that are both frequent within a specific document and relatively rare across the entire corpus, thus helping to identify terms that are significant for a particular document.

The TF-IDF score for a term (word) in a document is calculated using the following formula:

TF-IDF (t,d,D) = TF(t,d) x IDF(t, D)

Applications of TF-IDF include information retrieval, document ranking, text summarization and text mining.

Although frequency-based embeddings are straightforward and easy to understand, they lack the depth of semantic information and context awareness provided by more advanced prediction-based embeddings.

Prediction-based embeddings

Prediction-based embeddings are word representations derived from models that are trained to predict certain aspects of a word's context or neighboring words. Unlike frequency-based embeddings that focus on word occurrence statistics, prediction-based embeddings capture semantic relationships and contextual information, providing richer representations of word meanings.

Prediction-based embeddings can differentiate between synonyms and handle polysemy (multiple meanings of a word) more effectively. The vector space properties of prediction-based embeddings enable tasks like measuring word similarity and solving analogies. Prediction-based embeddings can also generalize well to unseen words or contexts, making them robust in handling out-of-vocabulary terms.

Prediction-based methods, particularly those like Word2Vec and GloVe (discussed below), have become dominant in the field of word embeddings due to their ability to capture rich semantic meaning and generalize well to various NLP tasks.

Word2Vec

Developed by a team of researchers at Google, including Tomas Mikolov, in 2013, Word2Vec (Word to Vector) has become a foundational technique for learning word embeddings in natural language processing (NLP) and machine learning models.

Word2Vec consists of two main models for generating vector representations: Continuous Bag of Words (CBOW) and Continuous Skip-gram.

In the context of Word2Vec, the Continuous Bag of Words (CBOW) model aims to predict a target word based on its surrounding context words within a given window. It uses the context words to predict the target word, and the learned embeddings capture semantic relationships between words.

The Continuous Skip-gram model, on the other hand, takes a target word as input and aims to predict the surrounding context words.

How the models are trained

Given a sequence of words in a sentence, the CBOW model takes a fixed number of context words (words surrounding the target word) as input. Each context word is represented as an embedding (vector) through a shared embedding layer. These embeddings are learned during the training process.

The individual context word embeddings are aggregated, typically by summing or averaging them. This aggregated representation serves as the input to the next layer.

The aggregated representation is then used to predict the target word using a softmax activation function. The model is trained to minimize the difference between its predicted probability distribution over the vocabulary and the actual distribution (one-hot encoded representation) for the target word.

The CBOW model is trained by adjusting the weights of the embedding layer based on its ability to predict the target word accurately.

The Continuous Skip-gram model uses training data to predict the context words based on the target word's embedding. Specifically, it outputs a probability distribution over the vocabulary, indicating the likelihood of each word being in the context given the target word.

The training objective is to maximize the likelihood of the actual context words given the target word. This involves adjusting the weights of the embedding layer to minimize the difference between the predicted probabilities and the actual distribution of context words. The model also allows for a flexible context window size. It can be adjusted based on the specific requirements of the task, allowing users to capture both local and global context relationships.

The Skip-gram model is essentially "skipping" from the target word to predict its context, which makes it particularly effective in capturing semantic relationships and similarities between words.

Advantages and limitations

Both models used by Word2Vec have their own advantages and limitations. Skip-gram works well with handling vast amounts of text data and is found to represent rare words well. CBOW, on the other hand, is faster and has better representations for more frequent words.

As far as limitations, Word2Vec may not effectively handle polysemy, where a single word has multiple meanings. The model might average or mix the representations of different senses of a polysemous word. Word2Vec also treats words as atomic units and does not capture subword information.

Addressing some of these limitations has been the motivation for the development of more advanced models, such as FastText, GloVe and transformer-based models (discussed below), which aim to overcome some of Word2Vec’s shortcomings.

GloVe

GloVe (Global Vectors for Word Representation) is a word embedding model designed to capture global statistical information about word co-occurrence patterns in a corpus.

Introduced by Jeffrey Pennington, Richard Socher and Christopher D. Manning in 2014, the GloVe model differs from Word2Vec by emphasizing the use of global information rather than focusing solely on local context.

GloVe is based on the idea that the global statistics of word co-occurrence across the entire corpus are crucial for capturing word semantics. It considers how frequently words co-occur with each other in the entire dataset rather than just in the local context of individual words.

The model aims to minimize the difference between the predicted co-occurrence probabilities and the actual probabilities derived from the corpus statistics.

GloVe is computationally efficient compared to some other methods, as it relies on global statistics and employs matrix factorization techniques to learn the word vectors. The model can be trained on large corpora without the need for extensive computational resources.

GloVe introduces scalar weights for word pairs to control the influence of different word pairs on the training process. These weights help mitigate the impact of very frequent or rare word pairs on the learned embeddings.

Training mechanism

Unlike the Word2Vec models (CBOW and Skip-gram), which focus on predicting context words given a target word or vice versa, GloVe uses a different approach that involves optimizing word vectors based on their co-occurrence probabilities. The training process is designed to learn embeddings that effectively capture the semantic relationships between words.

The first step is constructing a co-occurrence matrix that represents how often words appear together in the corpus.

Next is formulating an objective function that describes the relationship between word vectors and their co-occurrence probabilities.

The objective function is optimized using gradient descent or other optimization algorithms. The goal is to adjust the word vectors and biases to minimize the squared difference between the predicted and actual logarithmic co-occurrence probabilities.

Applications and use cases

Users can download pre-trained GloVe embeddings and fine-tune them for specific applications or use them directly.

GloVe embeddings are widely used in NLP tasks, such as text classification, sentiment analysis, machine translation and more.

GloVe excels in scenarios where capturing global semantic relationships, understanding the overall context of words and leveraging co-occurrence statistics are critical for the success of natural language processing tasks.

Beyond Word2Vec and GloVe

The success of Word2Vec and GloVe have inspired further research into more sophisticated language representation models, such as FastText, BERT and GPT. These models leverage subword embeddings, attention mechanisms and transformers to effectively handle higher dimension embeddings.

Subword embeddings

Subword embeddings, such as FastText, represent words as combinations of subword units, providing more flexibility and handling rare or out-of-vocabulary words. Subword embeddings improve the robustness and coverage of word embeddings.

Unlike GloVe, FastText embeds words by treating each word as being composed of character n-grams instead of a word whole. This feature enables it not only to learn rare words but also out-of-vocabulary words.

Attention mechanisms and transformers

Attention mechanisms and transformer models consider contextual information and bidirectional relationships between words, leading to more advanced language representations.

Attention mechanisms were introduced to improve the ability of neural networks to focus on specific parts of the input sequence when making predictions. Instead of treating all parts of the input equally, attention mechanisms allow the model to selectively attend to relevant portions of the input.

Transformers have become the backbone of various state-of-the-art models in NLP, including BERT, GPT and T5 (Text-to-Text Transfer Transformer), among others. They excel in tasks such as language modeling, machine translation, text generation and question answering.

Transformers use a self-attention mechanism to capture relationships between different words in a sequence. This mechanism allows each word to attend to all other words in the sequence, capturing long-range dependencies.

Transformers allow for more parallelization during training compared to RNNs and are computationally efficient.

Unpacking the agentic AI journey: what delivers, what distracts, and what deserves your investment

Join us to explore where agentic AI is already delivering measurable value, where the technology is still evolving, and how to prioritize investments that align with your organization’s strategic goals.

What are word embeddings?

Authors

What are word embeddings?

The latest AI trends, brought to you by experts

Thank you! You are subscribed.

How word embeddings are used

Decoding AI: Weekly News Roundup

A brief history of word embeddings

How word embeddings are created

Foundational aspects of word embeddings

Two approaches to word embeddings

Frequency-based embeddings

Prediction-based embeddings

Word2Vec

How the models are trained

Advantages and limitations

GloVe

Training mechanism

Applications and use cases

Beyond Word2Vec and GloVe

Subword embeddings

Attention mechanisms and transformers

Resources