#### IBM Research-Zurich

# Linear-Complexity Earth Mover’s Distance Approximations for Efficient Similarity Search

July 16, 2019 | Written by: Haris Pozidis and Kubilay Atasu

Categorized: IBM Research-Zurich

Share this post:

Measuring distances between entities is a fundamental function that powers many mission-critical applications across various sectors. Search, data mining, data discovery, clustering and classification all rely on discriminative distance measures between data objects. In addition to being accurate, today’s vast databases demand distance measures that are also scalable, i.e. can be computed efficiently even across large data collections.

The Earth Mover’s Distance (EMD), also known as Discrete Wasserstein distance, is a highly discriminative metric for measuring distance between probability distributions that has been applied successfully in various fields. It is known as EMD in the image processing field, Word Mover’s Distance (WMD) in text processing, and Wasserstein distance in optimization theory.

The Wasserstein distance, in addition to its applications in text and image retrieval, has important applications in the machine learning field. For instance, it can be used to evaluate the quality of the samples produced by a generative model by measuring the distance between the synthetic samples and original data. More recently, it has been shown that when training generative adversarial networks and auto-encoders, the stability of the model training and the accuracy of the samples generated can be improved if the Wasserstein metric is used as the loss function.

Despite its high discriminative capability, the EMD has not yet found wide adoption. This is because of its high computational complexity — cubic on the size of the input distribution which prohibits its application in very large collections of millions or more objects. To mitigate this problem, several prior works proposed approximations of the EMD with reduced complexity, but also reduced accuracy.

In our work [1], which was recently presented at the ICML 2019 conference, we propose a new tight approximation to the EMD which maintains its accuracy, yet has linear complexity and can be computed efficiently across millions of objects. Furthermore, our new method alleviates several important limitations of prior approximations.

### Word Movers Distance and approximations

In text retrieval, the Word Mover’s distance (WMD) has recently emerged as the state-of-the art semantic- similarity metric for unstructured text [2]. WMD captures semantic similarity by capitalizing on the concept of word embedding, which map words into a high-dimensional vector space such that words that are semantically similar are in close proximity to each other. For instance, given the sentences “The Queen to tour Canada” and “Royal visit to Halifax”, despite having no words in common, after stop word removal, WMD can detect that they are similar.

WMD offers high accuracy, but suffers from a high computational complexity, which grows cubically with the number of unique words in the documents. Therefore, Kusner et al. [2] proposed an approximation of WMD, called Relaxed WMD or RWMD, which has quadratic complexity. In our prior work [3], we proposed an algorithm for computing RWMD which has linear average time complexity when applied to large data sets. This algorithm achieves sub-second (120ms) query performance on data sets of 1 million documents with average size of 105 words.

### New distance measures

Having improved the scalability of RWMD to large collections of documents, we turn to investigate its accuracy. When does it become ineffective, and is it applicable to other signals, e.g. images? As it turns out, RWMD exhibits a higher approximation error compared to WMD when the overlap between histogram coordinates is high, as may be the case in long documents. Furthermore, RWMD becomes ineffective for dense histograms with background noise, such as large and uncurated text documents or images.

Can we find more effective distance measures? This is the question we set out to address in our recent work [1]. For completeness, we note that another effective distance measure, which closely approximates the EMD, has been proposed by Cuturi in [4]. Its complexity, however, is quadratic in the histogram size.

We are measuring distances between discrete probability distributions or histograms. Given two histograms, we need to transfer all the weights from the bins of the first histogram into the bins of the second histogram: we impose outflow constraints to ensure that. We also need to reconstruct the second histogram as a result: we use inflow constraints to ensure that.

Our goal is to minimize the overall transportation cost. The complexity of computing the optimal solution to the EMD is cubical in the size of the histograms. We have discussed a linear complexity approximation, the RWMD. We propose a tighter relaxation of the EMD problem than RWMD, yet still achieve a linear time complexity.

Our solution, called Iterative Constrained Transfers (ICT) algorithm, addresses the weaknesses of the RWMD method, namely the fact that it relaxes either the inflow or the outflow constraints completely. We propose the addition of edge capacity constraints when relaxing the inflow or outflow constraints. For instance, if the weight of a source histogram bin is 5, and the weight of a destination histogram bin is 3, we prevent the transfer of more than 3 units between the two by adding a capacity constraint of 3 on the edge that links the two bins. The principle of the ICT algorithm is illustrated in Figure 1.

The new problem we formulated can be solved optimally by sorting the edges in increasing order of transportation cost and by iteratively transferring weights from the source bins into the destination bins under capacity constraints. This is the ICT algorithm. The ICT algorithm itself can be approximated by fixing the number of iterations it performs. The result is the Approximate Constrained Transfers (ACT) algorithm. It can be shown that both ACT and ICT are tighter lower bounds of EMD than RWMD. Importantly, our algorithms achieve linear complexity when comparing one query document with a database of documents.

We have evaluated our algorithms on the popular 20 Newsgroups (text) and MNIST (images) datasets. The results are shown in Figures 2 and 3, respectively. The left and right panes illustrate the runtime results and the color-coded search accuracy results, respectively. The darkest red shows the precision at top-1, and then top-2, top-4, etc. results are shown in lighter red shades.

On 20 Newsgroups, our new algorithms are 20,000 times faster than WMD and match its search accuracy. In case of MNIST, we are 100,000 times faster than WMD and even achieve a slightly higher search accuracy.

### Summary

We have developed new distance measures that are tight approximations to the Earth Mover’s Distance (EMD) and thus offer high search and classification accuracy; yet, unlike the EMD, can be computed fast, and are scalable to data collections of millions of objects. Moreover, the new distances are broadly applicable to dense and sparse histograms alike, and for example can discriminate both text and images even if there is background noise.

The new distance measures can be computed in linear time complexity in the histogram size. We have demonstrated 20,000-fold speed-up with respect to the WMD without any loss of accuracy. We believe that these new distance measures will benefit the data mining and machine learning communities alike, and are looking forward to seeing a wealth of new and exciting applications in different fields. Finally, these new algorithms and their efficient CPU and GPU implementations have been recently integrated in IBM’s Watson Machine Learning Community Edition software, version 1.6.1, which was released in June 2019. The documentation for our library can be found here.

References

[1] Atasu, Kubilay, and Thomas Mittelholzer. “Linear-Complexity Data-Parallel Earth Mover’s Distance Approximations.” In International Conference on Machine Learning, pp. 364-373. 2019.

[2] Kusner, Matt, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. “From word embeddings to document distances.” In International Conference on Machine Learning, pp. 957-966. 2015.

[3] Atasu, Kubilay, Thomas Parnell, Celestine Dünner, Manolis Sifalakis, Haralampos Pozidis, Vasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, and Abdel Labbi. “Linear-complexity relaxed word Mover’s distance with GPU acceleration.” In 2017 IEEE International Conference on Big Data (Big Data), pp. 889-896. IEEE, 2017.

[4] Cuturi, Marco. “Sinkhorn distances: Lightspeed computation of optimal transport.” In Advances in neural information processing systems, pp. 2292-2300. 2013.

**Haris Pozidis**

Manager, Cloud Storage and Analytics Group, IBM Research

**Kubilay Atasu**

Research Staff Member, IBM Research

### Who. What. Why. New IBM algorithm models how the order of prior actions impacts events

To address the problem of ordinal impacts, our team at IBM T. J. Watson Research Center has developed OGEMs – or Ordinal Graphical Event Models – new dynamic, probabilistic graphical models for events. These models are part of the broader family of statistical and causal models called graphical event models (GEMs) that represent temporal relations where the dynamics are governed by a multivariate point process.

### IBM’s Squawk Bot AI helps make sense of financial data flood

In our recent work, we detail an AI and machine learning mechanism able to assist in correlating a large body of text with numerical data series used to describe financial performance as it evolves over time. Our deep learning-based system pulls out from large amounts of textual data potentially relevant and useful textual descriptions that explain the performance of a financial metric of interest – without the need of human experts or labelled data.

### COVID-19 a year later: What have we learned?

We’ve learned a lot during the past year about how to address global crises, but in my mind, one lesson cannot be ignored: The need for more strategic collaborations across institutions and sectors.