Measuring distances between entities is a fundamental function that powers many mission-critical applications across various sectors. Search, data mining, data discovery, clustering and classification all rely on discriminative distance measures between data objects. In addition to being accurate, today’s vast databases demand distance measures that are also scalable, i.e. can be computed efficiently even across large data collections.
The Earth Mover’s Distance (EMD), also known as Discrete Wasserstein distance, is a highly discriminative metric for measuring distance between probability distributions that has been applied successfully in various fields. It is known as EMD in the image processing field, Word Mover’s Distance (WMD) in text processing, and Wasserstein distance in optimization theory.
The Wasserstein distance, in addition to its applications in text and image retrieval, has important applications in the machine learning field. For instance, it can be used to evaluate the quality of the samples produced by a generative model by measuring the distance between the synthetic samples and original data. More recently, it has been shown that when training generative adversarial networks and auto-encoders, the stability of the model training and the accuracy of the samples generated can be improved if the Wasserstein metric is used as the loss function.
Fig 1: Illustration of the principle of the Iterative Constraint Transfers algorithm
Despite its high discriminative capability, the EMD has not yet found wide adoption. This is because of its high computational complexity — cubic on the size of the input distribution which prohibits its application in very large collections of millions or more objects. To mitigate this problem, several prior works proposed approximations of the EMD with reduced complexity, but also reduced accuracy.
In our work , which was recently presented at the ICML 2019 conference, we propose a new tight approximation to the EMD which maintains its accuracy, yet has linear complexity and can be computed efficiently across millions of objects. Furthermore, our new method alleviates several important limitations of prior approximations.
Word Movers Distance and approximations
In text retrieval, the Word Mover’s distance (WMD) has recently emerged as the state-of-the art semantic- similarity metric for unstructured text . WMD captures semantic similarity by capitalizing on the concept of word embedding, which map words into a high-dimensional vector space such that words that are semantically similar are in close proximity to each other. For instance, given the sentences “The Queen to tour Canada” and “Royal visit to Halifax”, despite having no words in common, after stop word removal, WMD can detect that they are similar.
WMD offers high accuracy, but suffers from a high computational complexity, which grows cubically with the number of unique words in the documents. Therefore, Kusner et al.  proposed an approximation of WMD, called Relaxed WMD or RWMD, which has quadratic complexity. In our prior work , we proposed an algorithm for computing RWMD which has linear average time complexity when applied to large data sets. This algorithm achieves sub-second (120ms) query performance on data sets of 1 million documents with average size of 105 words.
New distance measures
Having improved the scalability of RWMD to large collections of documents, we turn to investigate its accuracy. When does it become ineffective, and is it applicable to other signals, e.g. images? As it turns out, RWMD exhibits a higher approximation error compared to WMD when the overlap between histogram coordinates is high, as may be the case in long documents. Furthermore, RWMD becomes ineffective for dense histograms with background noise, such as large and uncurated text documents or images.