The ability to infer abstract high-level concepts from raw sensory inputs is a key part of human intelligence. Developing models that recapitulate this ability is an important goal in AI research. A fundamental challenge in this respect is disentangling the underlying factors of variation that give rise to the observed data. For example, factors of variation underlying images of people may include hair color, head tilt, or degree of smile. In our recent ICLR 2018 paper (Variational Inference of Disentangled Latent Concepts from Unlabeled Observations, written by Abhishek Kumar, Prasanna Sattigeri and Avinash Balakrishnan), we describe a principled approach for unsupervised learning of disentangled hidden factors from a large pool of unlabeled observations. Disentanglement of latent factors is an important building block in the development and widespread acceptance of machine learning solutions.
Moving to unsupervised data
Most real-world scenarios involve raw observations without any supervision about the generative factors. For such data, we can rely on latent generative models such as variational autoencoder (VAE)  that aim to maximize the likelihood of generating new examples that match the observed data. VAE models have a natural inference mechanism baked in and thus allow principled enhancement in the learning objective to encourage disentanglement in the latent space.
Variational Autoencoder (VAE): a brief overview
VAE starts with a generative model of the data which samples latents z from a prior p(z), followed by sampling the observation from pθ(x|z) (where θ are the parameters of the generator or decoder). The problem of inference is to compute the posterior of the latents conditioned on the observation x:
VAE achieves this by learning an approximation of a recognition model, parameterized by ϕ, that encodes an inverse map from the observations to the approximate posteriors. The recognition model parameters are learned by optimizing the problem:
where the outer expectation is over the true data distribution p(x) from which we have samples. This can be shown as equivalent to maximizing what is referred as evidence lower bound (ELBO):
Where VAE falls short
For inferring disentangled factors, inferred prior or expected variational posterior, qϕ(z) = ∫qϕ(z│x)p(x)dx, should be factorizable along its dimensions. This can be achieved by minimizing a suitable distance between the inferred prior qϕ(z) and the disentangled generative prior p(z). We can also define expected posterior as pθ(z) = ∫pθ(z│x)p(x)dx. If we take KL-divergence as our choice of distance, by relying on its pairwise convexity, it can be shown that this distance is bounded by ELBO, the objective of the variational inference.
This is the reason that the original VAE has also been observed to exhibit some disentangling behavior on simple datasets such as MNIST. However, this behavior does not carry over to more complex datasets, unless extra supervision on the generative factors is provided. This can be due to: (i) true data distribution p(x) and modeled data distribution pθ(x) = ∫pθ(x│z)p(z)dz being far apart, which in turn causes p(z) and pθ(z) to be far; and (ii) the non-convexity of the ELBO objective, which prevents us from achieving the global minimum.
Matching the distributions to get disentanglement
To explicitly encourage disentanglement during inference, we added D(qϕ(z,p(z)) to the model as part of the objective (Figure 1).
We adopted a simple yet effective approach of matching the moments of the two distributions. Matching the covariance of the two distributions will amount to decorrelating the dimensions of the inferred prior. We call this modified ELBO model “Disentangled Inferred Prior-VAE” or DIP-VAE.
A new metric for measuring disentanglement
We also proposed a new metric to evaluate the degree of disentanglement, assuming that the ground truth values of the attributes to disentangle are known. We referred to this as a Separated Attribute Predictability (SAP) score. We found this score to have good alignment with qualitative disentanglement observed in the decoder’s output while doing latent traversals. To compute SAP, we first constructed a d × k score matrix S (for d latents and k generative factors) whose ijth entry is the linear regression or classification score (depending on the generative factor type) of predicting jth factor using only ith latent (Figure 2).
For each column of the score matrix which corresponds to a generative factor, we calculated the difference of the top two entries (corresponding to the top two most predictive latent dimensions) and then calculated the mean of these differences as the final SAP score. A high SAP score indicates that each generative factor is primarily captured in only one latent dimension. We also observe that SAP score is aligned well with the disentanglement in the generated images by the decoder. Figure 3 qualitatively shows the mapping of a selected few latents to real world concepts for CelebA face images .
Disentangled representations can impact several important areas in machine learning and AI. They can lead to decisions that are potentially comprehensible by humans, improving interpretability. They are arguably better suited to transfer learning, as most of the key underlying generative factors appear segregated along feature dimensions and can be selectively shared across various tasks. These representations will also enable controllable generation of new data through generative models. We also believe that disentanglement is a useful prior for representations learned from unsupervised learning, enabling better transfer to supervised tasks later on. Finally, machine learning systems built on disentangled representations will be more amenable to inspection, helping to build trust in machine learning systems.
 Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ICLR, 2014
 Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, 2015.