#### IBM Research-Zurich

# Linear-Complexity Earth Mover’s Distance Approximations for Efficient Similarity Search

July 16, 2019 | Written by: Haris Pozidis and Kubilay Atasu

Categorized: IBM Research-Zurich

Share this post:

Measuring distances between entities is a fundamental function that powers many mission-critical applications across various sectors. Search, data mining, data discovery, clustering and classification all rely on discriminative distance measures between data objects. In addition to being accurate, today’s vast databases demand distance measures that are also scalable, i.e. can be computed efficiently even across large data collections.

The Earth Mover’s Distance (EMD), also known as Discrete Wasserstein distance, is a highly discriminative metric for measuring distance between probability distributions that has been applied successfully in various fields. It is known as EMD in the image processing field, Word Mover’s Distance (WMD) in text processing, and Wasserstein distance in optimization theory.

The Wasserstein distance, in addition to its applications in text and image retrieval, has important applications in the machine learning field. For instance, it can be used to evaluate the quality of the samples produced by a generative model by measuring the distance between the synthetic samples and original data. More recently, it has been shown that when training generative adversarial networks and auto-encoders, the stability of the model training and the accuracy of the samples generated can be improved if the Wasserstein metric is used as the loss function.

Despite its high discriminative capability, the EMD has not yet found wide adoption. This is because of its high computational complexity — cubic on the size of the input distribution which prohibits its application in very large collections of millions or more objects. To mitigate this problem, several prior works proposed approximations of the EMD with reduced complexity, but also reduced accuracy.

In our work [1], which was recently presented at the ICML 2019 conference, we propose a new tight approximation to the EMD which maintains its accuracy, yet has linear complexity and can be computed efficiently across millions of objects. Furthermore, our new method alleviates several important limitations of prior approximations.

### Word Movers Distance and approximations

In text retrieval, the Word Mover’s distance (WMD) has recently emerged as the state-of-the art semantic- similarity metric for unstructured text [2]. WMD captures semantic similarity by capitalizing on the concept of word embedding, which map words into a high-dimensional vector space such that words that are semantically similar are in close proximity to each other. For instance, given the sentences “The Queen to tour Canada” and “Royal visit to Halifax”, despite having no words in common, after stop word removal, WMD can detect that they are similar.

WMD offers high accuracy, but suffers from a high computational complexity, which grows cubically with the number of unique words in the documents. Therefore, Kusner et al. [2] proposed an approximation of WMD, called Relaxed WMD or RWMD, which has quadratic complexity. In our prior work [3], we proposed an algorithm for computing RWMD which has linear average time complexity when applied to large data sets. This algorithm achieves sub-second (120ms) query performance on data sets of 1 million documents with average size of 105 words.

### New distance measures

Having improved the scalability of RWMD to large collections of documents, we turn to investigate its accuracy. When does it become ineffective, and is it applicable to other signals, e.g. images? As it turns out, RWMD exhibits a higher approximation error compared to WMD when the overlap between histogram coordinates is high, as may be the case in long documents. Furthermore, RWMD becomes ineffective for dense histograms with background noise, such as large and uncurated text documents or images.

Can we find more effective distance measures? This is the question we set out to address in our recent work [1]. For completeness, we note that another effective distance measure, which closely approximates the EMD, has been proposed by Cuturi in [4]. Its complexity, however, is quadratic in the histogram size.

We are measuring distances between discrete probability distributions or histograms. Given two histograms, we need to transfer all the weights from the bins of the first histogram into the bins of the second histogram: we impose outflow constraints to ensure that. We also need to reconstruct the second histogram as a result: we use inflow constraints to ensure that.

Our goal is to minimize the overall transportation cost. The complexity of computing the optimal solution to the EMD is cubical in the size of the histograms. We have discussed a linear complexity approximation, the RWMD. We propose a tighter relaxation of the EMD problem than RWMD, yet still achieve a linear time complexity.

Our solution, called Iterative Constrained Transfers (ICT) algorithm, addresses the weaknesses of the RWMD method, namely the fact that it relaxes either the inflow or the outflow constraints completely. We propose the addition of edge capacity constraints when relaxing the inflow or outflow constraints. For instance, if the weight of a source histogram bin is 5, and the weight of a destination histogram bin is 3, we prevent the transfer of more than 3 units between the two by adding a capacity constraint of 3 on the edge that links the two bins. The principle of the ICT algorithm is illustrated in Figure 1.

The new problem we formulated can be solved optimally by sorting the edges in increasing order of transportation cost and by iteratively transferring weights from the source bins into the destination bins under capacity constraints. This is the ICT algorithm. The ICT algorithm itself can be approximated by fixing the number of iterations it performs. The result is the Approximate Constrained Transfers (ACT) algorithm. It can be shown that both ACT and ICT are tighter lower bounds of EMD than RWMD. Importantly, our algorithms achieve linear complexity when comparing one query document with a database of documents.

We have evaluated our algorithms on the popular 20 Newsgroups (text) and MNIST (images) datasets. The results are shown in Figures 2 and 3, respectively. The left and right panes illustrate the runtime results and the color-coded search accuracy results, respectively. The darkest red shows the precision at top-1, and then top-2, top-4, etc. results are shown in lighter red shades.

On 20 Newsgroups, our new algorithms are 20,000 times faster than WMD and match its search accuracy. In case of MNIST, we are 100,000 times faster than WMD and even achieve a slightly higher search accuracy.

### Summary

We have developed new distance measures that are tight approximations to the Earth Mover’s Distance (EMD) and thus offer high search and classification accuracy; yet, unlike the EMD, can be computed fast, and are scalable to data collections of millions of objects. Moreover, the new distances are broadly applicable to dense and sparse histograms alike, and for example can discriminate both text and images even if there is background noise.

The new distance measures can be computed in linear time complexity in the histogram size. We have demonstrated 20,000-fold speed-up with respect to the WMD without any loss of accuracy. We believe that these new distance measures will benefit the data mining and machine learning communities alike, and are looking forward to seeing a wealth of new and exciting applications in different fields. Finally, these new algorithms and their efficient CPU and GPU implementations have been recently integrated in IBM’s Watson Machine Learning Community Edition software, version 1.6.1, which was released in June 2019. The documentation for our library can be found here.

References

[1] Atasu, Kubilay, and Thomas Mittelholzer. “Linear-Complexity Data-Parallel Earth Mover’s Distance Approximations.” In International Conference on Machine Learning, pp. 364-373. 2019.

[2] Kusner, Matt, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. “From word embeddings to document distances.” In International Conference on Machine Learning, pp. 957-966. 2015.

[3] Atasu, Kubilay, Thomas Parnell, Celestine Dünner, Manolis Sifalakis, Haralampos Pozidis, Vasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, and Abdel Labbi. “Linear-complexity relaxed word Mover’s distance with GPU acceleration.” In 2017 IEEE International Conference on Big Data (Big Data), pp. 889-896. IEEE, 2017.

[4] Cuturi, Marco. “Sinkhorn distances: Lightspeed computation of optimal transport.” In Advances in neural information processing systems, pp. 2292-2300. 2013.

**Haris Pozidis**

Manager, Cloud Storage and Analytics Group, IBM Research

**Kubilay Atasu**

Research Staff Member, IBM Research

### We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page: https://research.ibm.com/blog

### Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

### From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.