#### AI

# IBM-Stanford team’s solution of a longstanding problem could greatly boost AI

December 16, 2020 | Written by: Soumyadip Ghosh and Mark Squillante

Categorized: AI

Share this post:

*A team of mathematicians has resolved an issue with the ‘optimal transport’ technique that compares distributions in machine learning – effectively getting rid of the infamous ‘curse of dimensionality’*

It all started with military barracks — and math.

In the 1940s during World War II, Russian mathematician Leonid Kantorovich wanted to minimize the time soldiers would spend getting from their barracks to the front line. In science speak, this planning problem of minimizing the distance between two positions is dubbed ‘optimal transport’ and has been tickling mathematicians’ brainwaves for decades. Even centuries, to an extent, going back to 1781 and the father of differential geometry Gaspard Monge, with his early ideas in the field.

Optimal transport includes the basic problem of comparing probability distributions — mathematical functions that give the probabilities of different possible outcomes of an experiment. An optimal transport-based measure known as the Wasserstein distance — from another Russian mathematician Leonid Vaseršteĭn who introduced the idea in 1969 — is now increasingly used in statistics and machine learning for this purpose.

But brilliant as it might be, until now the Wasserstein distance has been thought to suffer from a hiccup – a so-called ‘curse of dimensionality.’ In other words, in theory the Wasserstein distance calculations had to crumple, with computation time growing significantly, as the problem dimension surged. But in practice, the calculations worked – meaning real-world machine learning problems have been demonstrating an increasing gap between the ‘curse’ defined by the theory and the efficiency of the empirical performance in practice.

This gap is exactly what we’ve managed to address in our recent research, by Nian Si (Stanford), Jose Blanchet (Stanford), Soumyadip Ghosh (IBM Research), and Mark Squillante (IBM Research).

Presented at this year’s leading global AI conference NeurIPS 2020, our paper shows that with our approach, application of the Wasserstein distance metric not only works — but that it beats the “curse of dimensionality.” We detail how the empirical Wasserstein distance converges at the standard rate, totally independent of the dimension of the problem.

Our mathematical formulation makes it possible for the first time to solve — in a computationally efficient way — problems over an infinite-dimensional hypothesis class of functions that can arise in practice. Our work can be applied in any real-world machine learning or statistical application where the dimension of the hypothesis class is not small — as is the case in most real-world applications.

### Saying goodbye to the ‘curse’

In the paper, we establish new mathematical outcomes that extend the existing results of optimal transport. This includes reducing the computational burden by showing the problem is equivalent to one that is easier and more computationally efficient to solve. We also establish new results on the statistical convergence rate for Wasserstein distances between an empirical distribution and the true (unknown) distribution over a class of functions (the ‘hypothesis class’) which can be infinite dimensional.

The ‘hypothesis class’ is a class of functions that makes it possible for a user to highlight the set of problem characteristics that are of greatest importance as part of solving a machine learning problem.

Given the substantial power of the Wasserstein distance to discriminate between two distributions based on a very broad and detailed set of characteristics, one can use an appropriate hypothesis class over which to constrain the Wasserstein distance in order to focus on the set of characteristics that are most important for the machine learning problem. In practice such computations were thought to be possible only when the hypothesis class is finite dimensional.

To arrive at our results, we first establish a new strong duality result that generalizes the celebrated Kantorovich-Rubinstein duality from the existing optimal transport theory, which now includes infinite-dimensional hypothesis classes. It is this duality result that supports more efficient computational methods by rendering an equivalent (dual) problem which can be solved more easily and efficiently.

We also consider a few examples of applying our theoretical results, including in particular one example where the dimension of the hypothesis class is infinite. Further studying this infinite-dimensional case, we then establish new results on the rate of statistical convergence for the empirical Wasserstein distance.

This result shows that convergence is at the standard parametric rate – thus our formulation can be used to beat the curse of dimensionality. Our final theoretical contributions extend both the strong duality and statistical convergence results to a more general mathematical setting (‘non-compact’ sample space). Please see Figure 1 for a more precise description of aspect of the infinite-dimensional hypothesis class example together with our statistical convergence rates for the empirical Wasserstein distance.

We then conducted preliminary numerical experiments based on our theoretical results in the context of a simple hypothesis testing example. Our theory now allows users to specify preferred characteristics using an infinite-dimensional hypothesis class.

The experimental outcome shows that our method, based on our theoretical results, can efficiently distinguish between two distributions with user-preferred characteristics while providing good accuracy. These experimental results provide further insights and help to clarify why, despite the curse of dimensionality, the Wasserstein distance metric still performs well. And that, independent of the dimension of the hypothesis class, and across a wide range of machine learning and statistical applications.

From a theoretical perspective, we’ve established a general and fairly complete set of mathematical results to solve a fundamental gap between theory and practice in machine learning and statistical applications. At the same time, we are interested in studying statistical convergence for general function classes and in developing efficient algorithms to compute .

Our next big step is also to expand the preliminary numerical experiments in the paper – which were intended to be an initial demonstration of our theoretical results — to a broad set of applications that leverage our theoretical insights.

It may have started with military barracks — but there is an absolutely infinite universe of possibilities to explore with the wonderful notion of optimal transport.

## Inventing What’s Next.

#### Stay up to date with the latest announcements, research, and events from IBM Research through our newsletter.

**Soumyadip Ghosh**

Research Staff Member, IBM Research

**Mark Squillante**

Distinguished Research Staff Member and Manager, Foundations of Probability and Dynamics, Math Sciences, IBM Research

### Moving beyond the self-reported scale: Objectively measuring chronic pain with AI

Together with Boston Scientific, we are presenting research that details the feasibility and progress towards our new pain measurement method at the 2021 North American Neuromodulation Society Annual Meeting.

### How the world’s first smartwatch inspired cutting-edge AI

Between 2000 and 2001, IBM Research made headlines when it launched an internet-enabled designer watch running Linux, an open-source operating system. Dubbed WatchPad, its aim was to demonstrate the capabilities of the then-novel OS for mobile and embedded devices.

### Who. What. Why. New IBM algorithm models how the order of prior actions impacts events

To address the problem of ordinal impacts, our team at IBM T. J. Watson Research Center has developed OGEMs – or Ordinal Graphical Event Models – new dynamic, probabilistic graphical models for events. These models are part of the broader family of statistical and causal models called graphical event models (GEMs) that represent temporal relations where the dynamics are governed by a multivariate point process.