# Optimal Transport for Label Switching: Using Geometry to Solve Problems in AI

Authored by: Pierre Monteiller (ENS), Sebastian Claici (MIT), Edward Chien (MIT), Farzaneh Mirzazadeh (IBM), Justin Solomon (MIT), Mikhail Yurochkin (IBM)

Machine learning and AI play an essential role in critical decision-making that affects all of our lives. Today, AI systems do everything from helping approve your mortgage to aiding medical professionals in diagnosing a health condition to recognizing when your car strays outside of your lane while driving. Despite many successes, the research community and practitioners are becoming increasingly aware of limitations of today’s approaches. One of the key challenges, which is of great importance in safety critical applications, is how to recognize and quantify uncertainty. Decision support systems should know when they don’t know.

A branch of statistics called Bayesian inference quantifies uncertainty. Instead of searching for a single answer, a Bayesian model looks for the posterior distribution of the parameters. The posterior distribution indicates the certainty of each answer. However, learning from this distribution is challenging because it assigns a confidence value to every choice of parameters. Thus, learning from the true posterior can easily become intractable.

Our new paper from the MIT-IBM Watson AI Lab and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), “Alleviating Label Switching with Optimal Transport,” considers how the optimal transport can efficiently “summarize” this uncertainty for a class of popular decision making problems. Optimal transport is the mathematical theory of calculating the most efficient way to get resources from one place to another. We show how this theory is relevant not just to matching supply and demand for logistics problems but also to more abstract challenges in Bayesian inference. Alleviating Label Switching with Optimal Transport will be featured at the 2019 NeurIPS conference.

## Label Switching

Imagine that we want to identify the date at which an important historical event occured. We no longer have access to historical records, but we believe that we can infer the approximate date by asking sufficiently many people around the world and averaging their responses. An ambiguity occurs because there is no globally standard date format–dd/mm/yy, mm/dd/yy, and yy/mm/dd are all used throughout the world–and thus a naive elementwise averaging can lead to a meaningless estimate. Variants of this issue occur to a much larger scale in many ML problems.

In Bayesian inference, decision-making is often associated with finding the posterior mean. Label switching is the situation when simply averaging Markov Chain Monte Carlo (MCMC) samples does not give the desired posterior mean. It is a side effect of symmetry in Bayesian models and hampers conventional inference methods.

Consider a dataset  to which we fit a model with  parameters. Our task is to summarize the posterior likelihood function $$p(\theta_1,\ldots,\theta_k)$$. A problem arises if the likelihood $$p$$ happens to be invariant to some action on the $$\theta_i$$. For example, if we try to fit a mixture of Gaussians, then the likelihood remains unchanged if we permute the $$\theta_i$$’s.

To illustrate why this can cause problems when computing posterior means, consider the following simple example: We are given draws from $$0.5N(-1,0.25) + 0.5N(1,0.25)$$, and we fit a mixture of two Gaussians with unknown means, but fixed variance $$0.25$$. Due to label switching, draws from the posterior distribution of the two means will be close to either $$(-1,1)$$ or $$(1,-1)$$. Naively averaging here will lead to the uninformative mean of $$(0,0)$$, the average of $$(-1,1)$$ and $$(1,-1)$$.

## The math behind our solution

Our paper demonstrates how to tackle issues surrounding label switching using the theory of optimal transport. First, let’s examine the high level picture. We are interested in the mean of the posterior distribution, denoted $$E_{\theta\sim P (D)}[\theta]$$, where $$E_{\theta\sim P (D)}$$ stands for the expectation (average) under draws from the posterior distribution.

We cannot take this average naively since there is some group $$G$$ (such as permutations) acting on each $$x$$ such that $$P(D) = P(g \cdot D)$$ for all group members $$g$$, and this symmetry can lead to useless means.

The figure on the left shows where this ambiguity comes from. The solid red x represents a sample from the posterior distribution, while the solid blue disk is our current estimate of the posterior mean. Due to label switching, the light red samples are equally likely to be sampled, and thus an update in the direction of just one of them is incorrect.

Instead we can look for a transformation $$f$$ of our parameter space with two properties:

1. Transformed parameters are invariant to group action: $$g \cdot f(\theta) = f(\theta)$$ for any $$\theta$$ and $$g$$.
2. The transformation $$f$$ has an inverse $$f^{-1}$$ that recovers points in parameter space.

Looking again at the figure above, an ideal update would take all possible symmetries of the red sample into account. This leads us to the following idea: Instead of computing expectations between samples, we can treat each sample as a uniform distribution over its orbit under the group action as shown on the left.

If we take this lifting as our transform $$f$$, we have to solve two problems. First, what is a good notion of expectation when our random variables are themselves distributions? Second, even if we have such a notion, we must show that the expectation is well behaved for $$f^{-1}$$ to be defined.

We leverage optimal transport to tackle both of these issues.

## Optimal transport

Suppose we want to construct buildings in certain locations throughout the world, and the materials required for their construction are at other locations.  Consider the question: “What is the most cost-efficient way of transporting the required materials to each of the construction sites?” We can use optimal transport to solve this problem.

If the cost of moving material from point $$A$$ to point $$B$$ is the same as the squared distance between $$A$$ and $$B$$, then the total cost is known as the 2-Wasserstein distance $$W_2$$ between the distributions. Wasserstein distances have recently demonstrated its utility for many problems in AI, such as Word Mover Distance in Natural Language Processing and a Wasserstein-inspired generative adversarial network (GAN).

In analogy to the definition of a weighted average of points $$x_1, \ldots, x_n$$ in Euclidean space as the minimizer of $$\sum_{i=1}^n \lambda_i \| x- x_i \|^2$$ over vectors $$x \in R^n$$, we can use the 2-Wasserstein distance to define a weighted average of distributions $$\mu_1, \ldots, \mu_n$$ in probability space as a solution of the optimization problem

${arg\, min}_{\nu \in P(X)} \sum_{i=1}^n \lambda_i W^2_2 (\mu_i, \nu).$

This is known as the Wasserstein barycenter of the distributions $$\mu_i$$, and its solution $$\nu$$ is the distribution that is closest to all $$\mu_i$$ under the Wasserstein distance simultaneously. If we take this as our notion of average, then we can define the expectation in our label switching approach as

${inf}_{\nu\in P (X)}E_{\theta\sim P (D)}\left[W_2^2 (f (\theta),\nu)\right]$

where, recall, $$f(\theta)$$ takes a sample from the posterior and turns it into a distribution over all points in the orbit under group actions.

This notion of average allows us to invert the orbit operation and go from a distribution to a single point by applying the quotient map of the group. In the paper, we show that this is a well-defined operation that yields a single point $$\mathrm{\Theta}$$.

While this approach is infeasible for large groups, we can usually work directly in the quotient space and use stochastic gradient descent to update our estimate of the mean represented as a single point rather than its orbit in $$G$$. This more scalable approach is illustrated in the last figure.

Many challenges remain, but our work makes an important step toward wider adoption of Bayesian inference in practice and can be easily combined with popular probabilistic programming languages. The use case for optimal transport provides new applications for determining and measuring uncertainty, an important component of machine learning and AI system support for decision making.

Previous Post

Next Post

More AI stories

### We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page: https://research.ibm.com/blog