The primary goal of word embeddings is to represent words in a way that captures their semantic relationships and contextual information. These vectors are numerical representations in a continuous vector space, where the relative positions of vectors reflect the semantic similarities and relationships between words.
The reason vectors are used to represent words is that most machine learning algorithms, including neural networks, are incapable of processing plain text in its raw form. They require numbers as inputs to perform any task.
The process of creating word embeddings involves training a model on a large corpus of text (e.g., Wikipedia or Google News). The corpus is preprocessed by tokenizing the text into words, removing stop words and punctuation and performing other text-cleaning tasks.
A sliding context window is applied to the text, and for each target word, the surrounding words within the window are considered as context words. The word embedding model is trained to predict a target word based on its context words or vice versa.
This allows models to capture diverse linguistic patterns and assign each word a unique vector, which represents the word's position in a continuous vector space. Words with similar meanings are positioned close to each other, and the distance and direction between vectors encode the degree of similarity.
The training process involves adjusting the parameters of the embedding model to minimize the difference between predicted and actual words in context.
Here's a simplified example of word embeddings for a very small corpus (6 words), where each word is represented as a 3-dimensional vector:
cat [0.2, -0.4, 0.7]
dog [0.6, 0.1, 0.5]
apple [0.8, -0.2, -0.3]
orange [0.7, -0.1, -0.6]
happy [-0.5, 0.9, 0.2]
sad [0.4, -0.7, -0.5]
In this example, each word (e.g., "cat," "dog," "apple") is associated with a unique vector. The values in the vector represent the word's position in a continuous 3-dimensional vector space. Words with similar meanings or contexts are expected to have similar vector representations. For instance, the vectors for "cat" and "dog" are close together, reflecting their semantic relationship. Likewise, the vectors for "happy" and "sad" have opposite directions, indicating their contrasting meanings.
The example above is highly simplified for illustration purposes. Actual word embeddings typically have hundreds of dimensions to capture more intricate relationships and nuances in meaning.