Word Mover’s Embedding: Universal Text Embedding from Word2Vec

Share this post:

Text representation plays an important role in many natural language processing (NLP) tasks such as document classification and clustering, sense disambiguation, machine translation, and document matching. Since there are no explicit features in text, developing effective text representations is an important goal in AI and NLP research. A fundamental challenge in this respect is learning universal text embedding that preserves the semantic meanings of each word and accounts for the global context information of text such as word order in sentence or document. In our paper at the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018; “Word Mover’s Embedding: From Word2Vec to Document Embedding”), we presented Word Mover’s Embedding (WME), an unsupervised generic framework that learns continuous vector representations for text of variable lengths such as a sentence, paragraph, or document. WME embeddings can be easily used for a wide range of downstream supervised and unsupervised tasks.

Towards universal text embedding

A recent empirically successful body of research makes use of distributional or contextual information together with simple neural network models to obtain vector-space representations of words and phrases. Among them, Word2Vec [1] and GloVe [2] are most well-known and widely used; they are trained over hundreds of billions of words and millions of named entities due to the models’ simplicity and scalability.

Encouraged by these successes, much effort is devoted towards learning semantic vector representations of sentences or documents. A simple but often effective approach is to use a weighted average over some or all of the embeddings of words in the document. While this is simple, important information could easily be lost in such a document representation, in part since it does not consider word order. A more sophisticated approach [3][4][5] has focused on jointly learning embeddings for both words and paragraphs using models similar to Word2Vec. However, these only use word order within a small context window; moreover, the quality of word embeddings learned in such a model may be limited by the size of the training corpus, which cannot scale to the large sizes used in the simpler word embedding models, and which may consequently weaken the quality of the document embeddings.

Word Mover’s Distance: measuring semantic distance between two documents

A novel document distance metric called Word Mover’s Distance (WMD) was recently introduced [6] to measure dissimilarity between two documents in Word2Vec embedding space. WMD, as a special case of Earth Mover’s Distance, is the distance between two text documents x, y ∈ χ that takes into account the alignments between words. Let |x|, |y| be the number of distinct words in x and y. Let fxR|x| and fyR|y| denote the normalized frequency vectors of each word in the documents x and y, respectively (so that  fxT1 = fyT1 = 1). Then the WMD distance between documents x and y is defined as:

Where F is the transportation flow matrix with Fij denoting the amount of flow traveling from i-th word in x xi to j-th word in y yj, and C is the transportation cost with Cij = dist(vxi,vyj) being the distance between two words measured in the Word2Vec embedding space. Building on top of Word2Vec, WMD is particularly useful and accurate for measuring the distance between documents with semantically close but syntactically different words as illustrated in Figure 1.

Illustration of Word Mover's Distance

Figure 1: An illustration of WMD. All non-stop words are marked as bold face. The orange triangles and the blue dots represent the word embeddings of documents x and y, respectively. WMD measures the distance between two documents, where semantically similar words are aligned.

Where WMD falls short

Despite its state-of-the-art KNN-based classification accuracy over other methods, combining KNN and WMD incurs very high computational cost. For instance, WMD is expensive to compute with computational complexity of O(L3log(L)), especially for long documents where L is large. When combining with KNN for document classification, it incurs even higher computational costs O(N2L3log(L)), where N is number of documents. More importantly, WMD is simply a distance that can be only combined with KNN or K-means, whereas many machine learning algorithms require a fixed-length feature representation as input.

WME via Word Mover’s Kernel

To give unsupervised semantic embeddings of texts of variable length, we extend a recently proposed distance kernel framework [7] to derive a positive-definite kernel from an alignment-aware document distance metric WMD. We start by defining the Word Mover’s Kernel:

Here, ω can be interpreted as a random document {vj}j=1,..,D that contains a collection of random word vectors in V, and p(ω) is a distribution over the space of all possible random documents Ω = UD=1,…,DmaxVDφw(x) is a possibly infinite dimensional feature map derived from the WMD between x and all possible documents ω ∈ Ω.

An insightful interpretation of this kernel:


and f(ω) = {WMD(x,ω) + WMD (ω,y)}

is a version of soft minimum function parameterized by p(ω) and γ. When γ is large and f(ω) is Lipschitz-continuous, the value of the softmin-variant is mostly determined by the minimum of f(ω). Note that, since WMD is a metric, by the triangular inequality we have:

and the equality holds if we allow the length of random document Dmax to be not smaller than L. Therefore, the proposed kernel serves as a good approximation to the WMD between any pair of documents x, y, as illustrated in Figure 2, where it is positive-definite by the definition.

Illustration of Word Mover's Embedding

Figure 2: An illustration of WME. All non-stop words are marked as bold face. The black squares represent the random word embeddings of a random document ω. Each document first aligns itself with the random document to measure the distance WMD(x,ω) and WMD(ω,y) and then by triangle equality the distance WMD(x,ω) between documents x and y can be approximated by (WMD(x,ω) + WMD(ω,y)).

WME via random features

Given the Word-Mover’s Kernel, we can then use the Monte-Carlo approximation:

Where i}i=1,…,R are i.i.d.random documents drawn from p(ω) and Z(x) = (1/√Rφωi(x))i=1,…,R gives a vector representation of document x. We call this random approximation WME. Once WME is computed, it can be utilized as an input feature matrix by a linear classifier or other more advanced classifiers.

Compared to KNN-WMD that requires O(N2L3log(L)), our WME approximation only requires super-linear complexity of O(NRLlog(L)) when D is constant. This is because in our case each evaluation of WMD only requires O(D2Llog(L)) due to the short length of D of our random documents. For the document classification task, when the document is long or the number of documents is large, WME with linear SVM can easily achieve the same classification accuracy as KNN-WMD but with 100x speed-up. More importantly, WME can achieve a perfect trade-off between computational cost and accuracy, as shown in Figure 3.

Word Mover's Embedding can achieve a perfect trade-off between computational cost and accuracy

Figure 3: Train (blue) and test (red) accuracy when varying R with fixed D.

In our experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matches or outperforms state-of-the-art techniques (KNN-WMD, Word2Vec, and Doc2Vec based methods), with significantly higher accuracy on problems of short length.


Learning universal text embeddings can impact several important areas in machine learning and AI. They are naturally designed for transfer learning (or domain adaptation), as most of the supervised models focusing on developing compositional supervised models to create a vector representation of sentences. These representations will also provide well pretrained sentence/document level embeddings for machine translation and sentence matching. Finally, machine learning systems built on Earth Mover’s Distance can also leverage WME to help significantly accelerate the computation and learn an effective semantic-preserving representation for their underlying applications.

We will present our EMNLP paper on Sunday, November 4, during the session 10E: Machine Learning (Posters and Demos), 11:00AM ‑ 12:30PM, in the Grand Hall.

[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. NIPS 2013.
[2] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. EMNLP 2014.
[3] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. ICML 2014.
[4] Minmin Chen. Efficient vector representation for documents through corruption. ICLR 2017.
[5] Matthew Peters et al. Deep contextualized word representations. NAACL 2018.
[6] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. ICML 2015.
[7] Lingfei Wu, Ian En-Hsu Yen, Fangli Xu, Pradeep Ravikumar, and Witbrock Michael. D2KE: From distance to kernel and embedding., 2018.

More AI stories

Exploring quantum spin liquids as a reservoir for atomic-scale electronics

In “Probing resonating valence bond states in artificial quantum magnets,” we show that quantum spin liquids can be built and probed with atomic precision.

Continue reading

Fine-grained visual recognition for mobile AR technical support

Our team of researchers recently published paper “Fine-Grained Visual Recognition in Mobile Augmented Reality for Technical Support,” in IEEE ISMAR 2020, which outlines an augmented reality (AR) solution that our colleagues in IBM Technology Support Services use to increase the rate of first-time fixes and reduce the mean time to recovery from a hardware disruption.

Continue reading

Hardware-aware approach for fault-tolerant quantum computation

Our article “Topological and subsystem codes on low-degree graphs with flag qubits” [1], published in Physical Review X, takes a bottom-up approach to quantum error correcting codes that are adapted to a heavy-hexagon lattice – a topology we implement in our latest 65-qubit Hummingbird (r2) chip, available to IBM Q Network users in the Manhattan-named system.

Continue reading