Modern data processing systems should be capable of ingesting, storing and searching across a prodigious amount of textual information. Efficient text search encompasses both a high quality of the results and the speed of execution across millions of documents.
The amount of unstructured text-based data is growing every day. Querying, clustering, and classifying this big data requires similarity computations across large sets of documents. Whereas low-complexity similarity metrics are available, attention has been shifting towards more complex methods that achieve a higher accuracy by making use of semantic similarity.
Artificial Intelligence and deep learning have heralded a new era in document similarity by capitalizing on vast amounts of data to resolve issues related to text synonymy and polysemy. Today, the quality of search results is guaranteed by a new class of text representations based on neural networks, such as the word2vec representation. The word2vec approach uses neural networks to map words to an appropriate vector space, wherein words that are semantically synonymous will be close to each other. The Word Mover’s Distance (WMD) [Figure 1] proposed recently by Kusner et. al., capitalizes on such vector representations to capture the semantic similarity between text documents.
WMD tries to find the minimum cost to transform one text to another in the embedding space. An example of this is shown in Figure 1, which can effectively map the sentence “The Queen to tour Canada” to the sentence “Royal visit to Halifax”, even though (after “stopword” removal) they have no words in common.
WMD has been shown to work very effectively in practice, outperforming both in precision and in recall many traditional IR similarity measures. However, WMD is costly to compute because it solves a minimum cost flow problem, which requires cubic time in the average number of words in a document. Previous research has proposed a lower-bound to the WMD called Relaxed WDM, or RWMD, which in practice, retrieves documents very similar to those produced by WMD, but at a much lower cost. However, RWMD still exhibits quadratic execution time complexity and can prove costly when searching across millions of documents.
The result is that, when computing the similarity of one document with h words against n documents each having on average h words using a vector space of dimension m, the complexity of the brute-force RWMD is O(nh2m), whereas under our methodology the complexity is O(nhm).
The linear-complexity RWMD maps well onto the linear algebra primitives supported by modern GPU (Graphics Processing Unit) devices and can be distributed across multiple GPUs for parallel execution. To demonstrate the performance of the new method, we have used two datasets of news documents.
Set 1 comprises 1 million documents, with an average number of 150 unique words per document. Set 2 comprises 2.8 million words, with an average number of 30 words per document. A set of documents is posed as the query.
For each document in the query, the search retrieves the k-most-similar documents in the database using a cluster of 16 NVIDIA Tesla P100 GPUs. The comparison of the execution time per query document between WMD, RWMD, and the linear-complexity RWMD is given in Figure 2, which shows a ~2800x improvement in speed with respect to a CPU-based implementation of the quadratic-complexity RWMD method by combining GPU acceleration and algorithmic improvements, where ~150x improvement is due to the new algorithm. In addition, we demonstrate a ~30000x improvement in speed with respect to a GPU-accelerated WMD implementation.
These results suggest that we could use the high-quality matches of the RWMD to query — in sub-second time — at least 100 million documents using only a modest computational infrastructure.
IBM is currently evaluating the technology internally to ingest business news to keep sellers updated on client news and industry trends.
We also believe semantic similarity can also help us address the “fake news” problem. For example, imagine that we have received a piece of news from an untrusted source and we would like to understand whether it’s fabricated or real. The first way to achieve this is to locate the correct and accurate information it is based on, if at all. Now imagine that we have a database of trusted documents. We can locate the documents that talk about the same topic in the trusted database using semantic similarity. Then, we can use these documents to validate or disprove the contents of the untrusted news, for instance, using question answering schemes.
 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.CoRR, abs/1301.3781, 2013.
 Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. From word embeddings to document distances. ICML 2015: 957–966.
 Kubilay Atasu, Thomas Parnell, Celestine Dünner, Manolis Sifalakis, Haralampos Pozidis, Vasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, Abdel Labbi. Linear-Complexity Relaxed Word Mover’s Distance with GPU Acceleration. IEEE Big Data 2017.