#### AI

# Sobolev Independence Criterion

December 5, 2019 | Written by: Youssef Mroueh, Mattia Rigotti, and Inkit Padhi

Categorized: AI

Share this post:

**Interpretable and Reproducible Discoveries With Deep Neural Networks**

A central question in data science is determining the variables or features in a dataset that are relevant for the prediction of a particular outcome. The problem is known as feature selection, and it has important applications in many disciplines including business and science. Genomics is one example, as establishing which specific genes are responsible for a particular disease can impact crucial medical decisions.

Feature selection is also a cornerstone of interpretability, a requirement that favors the deployment of decision models that are both usable and trustworthy. As a consequence, interpretability is increasingly sought after in many industries like healthcare and finance. Reliability, in terms of controlling the False Discovery Rate (FDR) of selecting experimental variables, is also fundamental in basic research, as a means to mitigate the reproducibility crisis discussed in recent scientific literature [1].

At the 2019 Conference and Workshop on Neural Information Processing Systems (NeurIPS) on December 8 – 14 in Vancouver, British Columbia, IBM Research AI will showcase a new paper, “Sobolev Independence Criterion.” In this paper, our team proposes a novel interpretable dependency measure, called Sobolev Independence Criterion (SIC), that provides feature importance scores that can be used for controlling FDR.

Our NeurIPS paper will be presented on December 10 from 10:45AM to 12:45PM at the East Exhibition Hall B + C #46. The code for reproducing the results obtained in the paper is available here.

**Feature Selection and FDR Control**

Given a high dimensional vector *x* (e.g., gene expressions), the feature selection problem consists of selecting the subset of variables *xj* that are relevant to predict a response variable *y* (e.g., presence or absence of a disease). This is typically done by creating a prediction model that is rich enough to capture the dependencies between *x* and *y* and defines importance scores for each feature, and a selection method that uses those scores to rank and select the features while controlling for false discoveries.

A principled way to determine the key features is by testing the feature selection with multiple dependent hypotheses, a so-called conditional dependency selection. This technique is at the heart of the Holdout Randomization Test (HRT) [2] and the Knockoff Filter introduced recently for FDR control [1].

The idea is to verify that the hypothesis that *xj* is not important to determine *y*. If this (null) hypothesis is refused by the data, we can say that *xj* is indeed relevant. The powerful aspect of this approach is that this null hypothesis can be expressed as the assumption that *y *and feature *xj* are independent given the other features. It is then straightforward to simulate this scenario through sampling to evaluate the likelihood of it being compatible with the dependency observed in the data (i.e. p-values).

**Prior Art **

Until now, the most popular methods for feature selection have been sparse linear models [4] like Lasso and Elastic Nets. While such linear models have the advantage of being interpretable and their weights can be used as feature importance scores, they fail to capture non-linear dependencies. Random Forests [3] are another popular technique that can capture non-linear dependencies, but have the downside that they only give rise to heuristic feature importance scores. Deep networks are known to have high capacity in capturing non-linear dependencies, and while some *post hoc* heuristics have been proposed to define deep features importance, the aim of our work is to define proper feature statistics for deep networks that can be used for principled feature selection in conjunction with FDR control methods such as HRT and knockoffs.

**SIC: Feature Selection as Dependency Maximization with nonlinear Sparsity prior**

Our work starts by proposing a new measure of statistical dependency. Called Sobolev Independence Criterion (SIC), it expresses the dependency between an input vector *x=(x1…xd) a*nd a response variable *y* as a discrepancy between the joint distribution *pxy* and the product of marginals *pxpy*, as quantified by a witness function *f*. This can be parametrized with a deep neural network and has the role of discriminating between samples from the observed joint distribution and samples drawn from the distribution of *x *and *y* independently. As stated in the paper, it’s then possible to express the dependency of *y *on a covariate *xj*, thanks to the witness function, through the magnitude of the partial derivative . Therefore, to measure dependency while focusing on a small subset of features, we can regularize the witness function *f* by using sparsity inducing gradient penalties [5]. The sparsity inducing gradient penalty is the non-linear equivalent of *l*1 norm for linear models. SIC is given in the following equation:

Note that we added an *l*2 regularizer also to ensure stability of the estimation.

In the paper, we give an efficient method to optimize this cost function appealing to a variational trick called *n*-trick (see this blog post) that results in the gradient penalty decomposing in the terms that can be naturally regarded as feature importance scores:

Note that forms a distribution on the features that is easy to interpret: low values of *n**k* means that the feature *xk* was not important in explaining the dependency between *x* and *y*. Conveniently, the importance scores of such features can be straightforwardly combined with HRT and knockoffs to control the FDR.

**Results**

In our NeurIPS paper, we validated the Sobolev Independence Criterion (SIC) on real biological data and on synthetic benchmarks and demonstrated its effectiveness for capturing non-linear dependencies. In one of such synthetic datasets proposed by [6] only 6 out of 50 correlated features are relevant to predict a response variable through a highly non-linear multivariate relation. Below we show the results of deploying SIC, on 250 instances sampled from this synthetic dataset, in comparison to the competing algorithms Elastic Net and Random Forest. In all cases, we control FDR using HRT. SIC, which essentially consists in a “hybrid” method that combines our gradient penalty applied on a predictive neural network, instead of a witness function, achieves intermediate performance. However, crucially, we see that SIC recovers a substantially higher true positive rate (power).

**Conclusion**

The Sobolev Independence Criterion allows for the use of expressive nonlinear models for feature selection with FDR control. It is a first step towards the use of deep neural networks for interpretable modeling with statistical guarantees. In future work, we are excited to explore the use of the Sobolev Independence Criterion for instance-wise feature selection as developed in [7], extending its applications to fairness and privacy aware learning.

**References**

[1] Panning for Gold:Model-X Knockoffs for High-dimensional Controlled Variable Selection, Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv.

[2] W. Tansey, V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei. The holdout randomization test:Principled and easy black box feature selection. *arXiv preprint arXiv:1811.00645*, 2018.

[3] Leo Breiman. Random forests. *Mach. Learn.*, 2001.

[4] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. *The Elements of Statistical Learning*. Springer New York Inc., 2001.

[5]Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, and Alessandro Verri. Nonpara-

metric sparsity and regularization. *J. Mach. Learn. Res.*, 2013.

[6] Jean Feng and Noah Simon. Sparse-input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592, 2017.

[7] Chen, Jianbo, et al. “Learning to explain: An information-theoretic perspective on model interpretation.” arXiv preprint arXiv:1802.07814 (2018)

**Youssef Mroueh**

Research Staff Member, IBM Research

**Mattia Rigotti**

IBM Research Staff Member

**Inkit Padhi**

Research Engineer, IBM Research

### AI Year in Review: Highlights of Papers from IBM Research in 2019

IBM’s leadership in AI continued in earnest in 2019, which was notable for a growing focus on critical topics such as making trustworthy AI work in practice, creating new AI engineering paradigms to scale AI for a broader use, and continuing to advance core AI capabilities.

### Optimal Transport for Label Switching: Using Geometry to Solve Problems in AI

A new paper from the MIT-IBM Watson AI Lab and MIT CSAIL considers how the optimal transport can efficiently “summarize” this uncertainty for a class of popular decision making problems.

### 2020 AI Predictions from IBM Research

In 2020, three themes will shape the advancement of AI: automation, natural language processing, and trust. From this lens, IBM Research is unveiling its annual five predictions for AI in 2020.