#### AI

# Word Mover’s Embedding: Universal Text Embedding from Word2Vec

November 1, 2018 | Written by: IBM Research Editorial Staff

Categorized: AI | Publications

Share this post:

Text representation plays an important role in many natural language processing (NLP) tasks such as document classification and clustering, sense disambiguation, machine translation, and document matching. Since there are no explicit features in text, developing effective text representations is an important goal in AI and NLP research. A fundamental challenge in this respect is learning universal text embedding that preserves the semantic meanings of each word and accounts for the global context information of text such as word order in sentence or document. In our paper at the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018; “Word Mover’s Embedding: From Word2Vec to Document Embedding”), we presented Word Mover’s Embedding (WME), an unsupervised generic framework that learns continuous vector representations for text of variable lengths such as a sentence, paragraph, or document. WME embeddings can be easily used for a wide range of downstream supervised and unsupervised tasks.

# Towards universal text embedding

A recent empirically successful body of research makes use of distributional or contextual information together with simple neural network models to obtain vector-space representations of words and phrases. Among them, Word2Vec [1] and GloVe [2] are most well-known and widely used; they are trained over hundreds of billions of words and millions of named entities due to the models’ simplicity and scalability.

Encouraged by these successes, much effort is devoted towards learning semantic vector representations of sentences or documents. A simple but often effective approach is to use a weighted average over some or all of the embeddings of words in the document. While this is simple, important information could easily be lost in such a document representation, in part since it does not consider word order. A more sophisticated approach [3][4][5] has focused on jointly learning embeddings for both words and paragraphs using models similar to Word2Vec. However, these only use word order within a small context window; moreover, the quality of word embeddings learned in such a model may be limited by the size of the training corpus, which cannot scale to the large sizes used in the simpler word embedding models, and which may consequently weaken the quality of the document embeddings.

# Word Mover’s Distance: measuring semantic distance between two documents

A novel document distance metric called Word Mover’s Distance (WMD) was recently introduced [6] to measure dissimilarity between two documents in Word2Vec embedding space. WMD, as a special case of Earth Mover’s Distance, is the distance between two text documents x, y ∈ χ that takes into account the alignments between words. Let |x|, |y| be the number of distinct words in x and y. Let *f _{x}* ∈

*R*

^{|x|}and

*f*∈

_{y}*R*

^{|y|}denote the normalized frequency vectors of each word in the documents x and y, respectively (so that

*f*1 =

_{x}^{T}*f*1 = 1). Then the WMD distance between documents x and y is defined as:

_{y}^{T}Where F is the transportation flow matrix with *F _{ij}* denoting the amount of flow traveling from i-th word in x

*x*to j-th word in y

_{i}*y*, and C is the transportation cost with

_{j}*C*=

_{ij}*dist(v*being the distance between two words measured in the Word2Vec embedding space. Building on top of Word2Vec, WMD is particularly useful and accurate for measuring the distance between documents with semantically close but syntactically different words as illustrated in Figure 1.

_{xi},v_{yj})# Where WMD falls short

Despite its state-of-the-art KNN-based classification accuracy over other methods, combining KNN and WMD incurs very high computational cost. For instance, WMD is expensive to compute with computational complexity of *O(L ^{3}log(L))*, especially for long documents where L is large. When combining with KNN for document classification, it incurs even higher computational costs

*O(N*, where N is number of documents. More importantly, WMD is simply a distance that can be only combined with KNN or K-means, whereas many machine learning algorithms require a fixed-length feature representation as input.

^{2}L^{3}log(L))# WME via Word Mover’s Kernel

To give unsupervised semantic embeddings of texts of variable length, we extend a recently proposed distance kernel framework [7] to derive a positive-definite kernel from an alignment-aware document distance metric WMD. We start by defining the Word Mover’s Kernel:

Here, *ω* can be interpreted as a random document *{v _{j}}_{j=1,..,D}* that contains a collection of random word vectors in

*V*, and

*p(ω)*is a distribution over the space of all possible random documents Ω

*= U*.

_{D=1,…,Dmax}V^{D}*φ*is a possibly infinite dimensional feature map derived from the WMD between x and all possible documents

_{w}(x)*ω ∈ Ω*.

An insightful interpretation of this kernel:

where

and *f(ω) = {WMD(x,ω) + WMD (ω,y)}*

is a version of soft minimum function parameterized by *p(ω)* and *γ*. When *γ* is large and *f(ω)* is Lipschitz-continuous, the value of the softmin-variant is mostly determined by the minimum of *f(ω)*. Note that, since WMD is a metric, by the triangular inequality we have:

and the equality holds if we allow the length of random document *D _{max}* to be not smaller than L. Therefore, the proposed kernel serves as a good approximation to the WMD between any pair of documents x, y, as illustrated in Figure 2, where it is positive-definite by the definition.

# WME via random features

Given the Word-Mover’s Kernel, we can then use the Monte-Carlo approximation:

Where *{ω _{i}}_{i=1,…,R}* are i.i.d.random documents drawn from

*p(ω)*and

*Z(x) = (1/√Rφ*gives a vector representation of document x. We call this random approximation WME. Once WME is computed, it can be utilized as an input feature matrix by a linear classifier or other more advanced classifiers.

_{ωi}(x))_{i=1,…,R}Compared to KNN-WMD that requires *O(N ^{2}L^{3}log(L))*, our WME approximation only requires super-linear complexity of

*O(NRLlog(L))*when D is constant. This is because in our case each evaluation of WMD only requires

*O(D*due to the short length of D of our random documents. For the document classification task, when the document is long or the number of documents is large, WME with linear SVM can easily achieve the same classification accuracy as KNN-WMD but with 100x speed-up. More importantly, WME can achieve a perfect trade-off between computational cost and accuracy, as shown in Figure 3.

^{2}Llog(L))In our experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matches or outperforms state-of-the-art techniques (KNN-WMD, Word2Vec, and Doc2Vec based methods), with significantly higher accuracy on problems of short length.

# Outlook

Learning universal text embeddings can impact several important areas in machine learning and AI. They are naturally designed for transfer learning (or domain adaptation), as most of the supervised models focusing on developing compositional supervised models to create a vector representation of sentences. These representations will also provide well pretrained sentence/document level embeddings for machine translation and sentence matching. Finally, machine learning systems built on Earth Mover’s Distance can also leverage WME to help significantly accelerate the computation and learn an effective semantic-preserving representation for their underlying applications.

**We will present our EMNLP paper on Sunday, November 4, during the session 10E: Machine Learning (Posters and Demos), 11:00AM ‑ 12:30PM, in the Grand Hall.**

[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. NIPS 2013.

[2] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. EMNLP 2014.

[3] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. ICML 2014.

[4] Minmin Chen. Efficient vector representation for documents through corruption. ICLR 2017.

[5] Matthew Peters et al. Deep contextualized word representations. NAACL 2018.

[6] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. ICML 2015.

[7] Lingfei Wu, Ian En-Hsu Yen, Fangli Xu, Pradeep Ravikumar, and Witbrock Michael. D2KE: From distance to kernel and embedding. https://arxiv.org/abs/1802.04956, 2018.

### Pushing the boundaries of convex optimization

Convex optimization problems, which involve the minimization of a convex function over a convex set, can be approximated in theory to any fixed precision in polynomial time. However, practical algorithms are known only for special cases. An important question is whether it is possible to develop algorithms for a broader subset of convex optimization problems that are efficient in both theory and practice.

### Making Neural Networks Robust with New Perspectives

IBM researchers have partnered with scientists from MIT, Northeastern University, Boston University and University of Minnesota to publish two papers on novel attacks and defenses for graph neural networks and on a new robust training algorithm called hierarchical random switching at IJCAI 2019.

### Improving the Scalability of AI Planning when Memory is Limited

We report new research results relevant to AI planning in our paper, "Depth-First Memory-Limited AND/OR Search and Unsolvability in Cyclic Search Spaces," presented at the International Joint Conference on Artificial Intelligence, IJCAI-19.