AI

IBM-Stanford team’s solution of a longstanding problem could greatly boost AI

Share this post:

A team of mathematicians has resolved an issue with the ‘optimal transport’ technique that compares distributions in machine learning – effectively getting rid of the infamous ‘curse of dimensionality’

 

It all started with military barracks — and math.

In the 1940s during World War II, Russian mathematician Leonid Kantorovich wanted to minimize the time soldiers would spend getting from their barracks to the front line. In science speak, this planning problem of minimizing the distance between two positions is dubbed ‘optimal transport’ and has been tickling mathematicians’ brainwaves for decades. Even centuries, to an extent, going back to 1781 and the father of differential geometry Gaspard Monge, with his early ideas in the field.

Optimal transport includes the basic problem of comparing probability distributions — mathematical functions that give the probabilities of different possible outcomes of an experiment. An optimal transport-based measure known as the Wasserstein distance — from another Russian mathematician Leonid Vaseršteĭn who introduced the idea in 1969 — is now increasingly used in statistics and machine learning for this purpose.

But brilliant as it might be, until now the Wasserstein distance has been thought to suffer from a hiccup – a so-called ‘curse of dimensionality.’ In other words, in theory the Wasserstein distance calculations had to crumple, with computation time growing significantly, as the problem dimension surged. But in practice, the calculations worked – meaning real-world machine learning problems have been demonstrating an increasing gap between the ‘curse’ defined by the theory and the efficiency of the empirical performance in practice.

This gap is exactly what we’ve managed to address in our recent research, by Nian Si (Stanford), Jose Blanchet (Stanford), Soumyadip Ghosh (IBM Research), and Mark Squillante (IBM Research).

Presented at this year’s leading global AI conference NeurIPS 2020, our paper shows that with our approach, application of the Wasserstein distance metric not only works — but that it beats the “curse of dimensionality.” We detail how the empirical Wasserstein distance converges at the standard rate, totally independent of the dimension of the problem.

Our mathematical formulation makes it possible for the first time to solve — in a computationally efficient way — problems over an infinite-dimensional hypothesis class of functions that can arise in practice. Our work can be applied in any real-world machine learning or statistical application where the dimension of the hypothesis class is not small — as is the case in most real-world applications.

Saying goodbye to the ‘curse’

In the paper, we establish new mathematical outcomes that extend the existing results of optimal transport. This includes reducing the computational burden by showing the problem is equivalent to one that is easier and more computationally efficient to solve. We also establish new results on the statistical convergence rate for Wasserstein distances between an empirical distribution and the true (unknown) distribution over a class of functions (the ‘hypothesis class’) which can be infinite dimensional.

Figure 1: The robust Wasserstein profile Rn, the empirical distance to a set of distributions satisfying linear constraints, is a key performance metric that should drop to zero on the order of n-1/2 in general and on the order of n-1 under stronger conditions (n being the size of a problem dataset). This was known to be true for finite sets of functions FB. We show that this is true also when FB is more usefully an infinite set of functions, a case that had previously been thought to suffer the ‘curse of dimensionality’. Here B(Ω) is the hypothesis class.

Figure 1: The robust Wasserstein profile Rn, the empirical distance to a set of distributions satisfying linear constraints, is a key performance metric that should drop to zero on the order of n-1/2 in general and on the order of n-1 under stronger conditions (n being the size of a problem dataset). This was known to be true for finite sets of functions FB. We show that this is true also when FB is more usefully an infinite set of functions, a case that had previously been thought to suffer the ‘curse of dimensionality’. Here B(Ω) is the hypothesis class.

 

The ‘hypothesis class’ is a class of functions that makes it possible for a user to highlight the set of problem characteristics that are of greatest importance as part of solving a machine learning problem.

Given the substantial power of the Wasserstein distance to discriminate between two distributions based on a very broad and detailed set of characteristics, one can use an appropriate hypothesis class over which to constrain the Wasserstein distance in order to focus on the set of characteristics that are most important for the machine learning problem. In practice such computations were thought to be possible only when the hypothesis class is finite dimensional.

To arrive at our results, we first establish a new strong duality result that generalizes the celebrated Kantorovich-Rubinstein duality from the existing optimal transport theory, which now includes infinite-dimensional hypothesis classes. It is this duality result that supports more efficient computational methods by rendering an equivalent (dual) problem which can be solved more easily and efficiently.

We also consider a few examples of applying our theoretical results, including in particular one example where the dimension of the hypothesis class is infinite. Further studying this infinite-dimensional case, we then establish new results on the rate of statistical convergence for the empirical Wasserstein distance.

This result shows that convergence is at the standard parametric rate – thus our formulation can be used to beat the curse of dimensionality. Our final theoretical contributions extend both the strong duality and statistical convergence results to a more general mathematical setting (‘non-compact’ sample space). Please see Figure 1 for a more precise description of aspect of the infinite-dimensional hypothesis class example together with our statistical convergence rates for the empirical Wasserstein distance.

We then conducted preliminary numerical experiments based on our theoretical results in the context of a simple hypothesis testing example. Our theory now allows users to specify preferred characteristics using an infinite-dimensional hypothesis class.

The experimental outcome shows that our method, based on our theoretical results, can efficiently distinguish between two distributions with user-preferred characteristics while providing good accuracy. These experimental results provide further insights and help to clarify why, despite the curse of dimensionality, the Wasserstein distance metric still performs well. And that, independent of the dimension of the hypothesis class, and across a wide range of machine learning and statistical applications.

From a theoretical perspective, we’ve established a general and fairly complete set of mathematical results to solve a fundamental gap between theory and practice in machine learning and statistical applications.  At the same time, we are interested in studying statistical convergence for general function classes and in developing efficient algorithms to compute .

Our next big step is also to expand the preliminary numerical experiments in the paper – which were intended to be an initial demonstration of our theoretical results — to a broad set of applications that leverage our theoretical insights.

It may have started with military barracks — but there is an absolutely infinite universe of possibilities to explore with the wonderful notion of optimal transport.

 

Inventing What’s Next.

Stay up to date with the latest announcements, research, and events from IBM Research through our newsletter.

 

Research Staff Member, IBM Research

Mark Squillante

Distinguished Research Staff Member and Manager, Foundations of Probability and Dynamics, Math Sciences, IBM Research

More AI stories

We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page: https://research.ibm.com/blog

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading