December 5, 2019 | Written by: Youssef Mroueh, Mattia Rigotti, and Inkit Padhi
Share this post:
Interpretable and Reproducible Discoveries With Deep Neural Networks
A central question in data science is determining the variables or features in a dataset that are relevant for the prediction of a particular outcome. The problem is known as feature selection, and it has important applications in many disciplines including business and science. Genomics is one example, as establishing which specific genes are responsible for a particular disease can impact crucial medical decisions.
Feature selection is also a cornerstone of interpretability, a requirement that favors the deployment of decision models that are both usable and trustworthy. As a consequence, interpretability is increasingly sought after in many industries like healthcare and finance. Reliability, in terms of controlling the False Discovery Rate (FDR) of selecting experimental variables, is also fundamental in basic research, as a means to mitigate the reproducibility crisis discussed in recent scientific literature .
At the 2019 Conference and Workshop on Neural Information Processing Systems (NeurIPS) on December 8 – 14 in Vancouver, British Columbia, IBM Research AI will showcase a new paper, “Sobolev Independence Criterion.” In this paper, our team proposes a novel interpretable dependency measure, called Sobolev Independence Criterion (SIC), that provides feature importance scores that can be used for controlling FDR.
Our NeurIPS paper will be presented on December 10 from 10:45AM to 12:45PM at the East Exhibition Hall B + C #46. The code for reproducing the results obtained in the paper is available here.
Feature Selection and FDR Control
Given a high dimensional vector x (e.g., gene expressions), the feature selection problem consists of selecting the subset of variables xj that are relevant to predict a response variable y (e.g., presence or absence of a disease). This is typically done by creating a prediction model that is rich enough to capture the dependencies between x and y and defines importance scores for each feature, and a selection method that uses those scores to rank and select the features while controlling for false discoveries.
A principled way to determine the key features is by testing the feature selection with multiple dependent hypotheses, a so-called conditional dependency selection. This technique is at the heart of the Holdout Randomization Test (HRT)  and the Knockoff Filter introduced recently for FDR control .
The idea is to verify that the hypothesis that xj is not important to determine y. If this (null) hypothesis is refused by the data, we can say that xj is indeed relevant. The powerful aspect of this approach is that this null hypothesis can be expressed as the assumption that y and feature xj are independent given the other features. It is then straightforward to simulate this scenario through sampling to evaluate the likelihood of it being compatible with the dependency observed in the data (i.e. p-values).
Until now, the most popular methods for feature selection have been sparse linear models  like Lasso and Elastic Nets. While such linear models have the advantage of being interpretable and their weights can be used as feature importance scores, they fail to capture non-linear dependencies. Random Forests  are another popular technique that can capture non-linear dependencies, but have the downside that they only give rise to heuristic feature importance scores. Deep networks are known to have high capacity in capturing non-linear dependencies, and while some post hoc heuristics have been proposed to define deep features importance, the aim of our work is to define proper feature statistics for deep networks that can be used for principled feature selection in conjunction with FDR control methods such as HRT and knockoffs.
SIC: Feature Selection as Dependency Maximization with nonlinear Sparsity prior
Our work starts by proposing a new measure of statistical dependency. Called Sobolev Independence Criterion (SIC), it expresses the dependency between an input vector x=(x1…xd) and a response variable y as a discrepancy between the joint distribution pxy and the product of marginals pxpy, as quantified by a witness function f. This can be parametrized with a deep neural network and has the role of discriminating between samples from the observed joint distribution and samples drawn from the distribution of x and y independently. As stated in the paper, it’s then possible to express the dependency of y on a covariate xj, thanks to the witness function, through the magnitude of the partial derivative . Therefore, to measure dependency while focusing on a small subset of features, we can regularize the witness function f by using sparsity inducing gradient penalties . The sparsity inducing gradient penalty is the non-linear equivalent of l1 norm for linear models. SIC is given in the following equation:
Note that we added an l2 regularizer also to ensure stability of the estimation.
In the paper, we give an efficient method to optimize this cost function appealing to a variational trick called n-trick (see this blog post) that results in the gradient penalty decomposing in the terms that can be naturally regarded as feature importance scores:
Note that forms a distribution on the features that is easy to interpret: low values of nk means that the feature xk was not important in explaining the dependency between x and y. Conveniently, the importance scores of such features can be straightforwardly combined with HRT and knockoffs to control the FDR.
In our NeurIPS paper, we validated the Sobolev Independence Criterion (SIC) on real biological data and on synthetic benchmarks and demonstrated its effectiveness for capturing non-linear dependencies. In one of such synthetic datasets proposed by  only 6 out of 50 correlated features are relevant to predict a response variable through a highly non-linear multivariate relation. Below we show the results of deploying SIC, on 250 instances sampled from this synthetic dataset, in comparison to the competing algorithms Elastic Net and Random Forest. In all cases, we control FDR using HRT. SIC, which essentially consists in a “hybrid” method that combines our gradient penalty applied on a predictive neural network, instead of a witness function, achieves intermediate performance. However, crucially, we see that SIC recovers a substantially higher true positive rate (power).
The Sobolev Independence Criterion allows for the use of expressive nonlinear models for feature selection with FDR control. It is a first step towards the use of deep neural networks for interpretable modeling with statistical guarantees. In future work, we are excited to explore the use of the Sobolev Independence Criterion for instance-wise feature selection as developed in , extending its applications to fairness and privacy aware learning.
 Panning for Gold:Model-X Knockoffs for High-dimensional Controlled Variable Selection, Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv.
 W. Tansey, V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei. The holdout randomization test:Principled and easy black box feature selection. arXiv preprint arXiv:1811.00645, 2018.
 Leo Breiman. Random forests. Mach. Learn., 2001.
 Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer New York Inc., 2001.
Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, and Alessandro Verri. Nonpara-
metric sparsity and regularization. J. Mach. Learn. Res., 2013.
 Jean Feng and Noah Simon. Sparse-input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592, 2017.
 Chen, Jianbo, et al. “Learning to explain: An information-theoretic perspective on model interpretation.” arXiv preprint arXiv:1802.07814 (2018)