Editor’s note: this article is by Vitaly Feldman, research scientist at IBM Research-Almaden
From discovering new particles and clinical studies to predicting election results and evaluating credit scores, scientific progress and industrial innovation increasingly rely on statistical data analysis. While incredibly useful, data analysis is also notoriously easy to misuse, even when the analyst has the best of intentions. Problems stemming from such misuse can be costlyand contribute to a wider concern about the reproducibility of research findings, most notably in medical research. The issue is hotly debatedin the scientific community and has attracted a lot of public attention in the recent years.
In recent research with a team of computer scientists from industry and academia, I made progress in understanding and addressing some of the ways in which data analysis can go wrong. Our work, Preserving Validity in Adaptive Data Analysis, published this week in Science, deals with important, but fairly technical and subtle, statistical issues that arise in the practice of data analysis. I’ll outline them below. For the less patient, here is also an elevator-pitch description (minus the elevator) of our work in the video, below.
The main difficulty in obtaining valid results is that it is hard to tell whether a pattern observed in the data represents a true relationship that holds in the real world, or is merely a coincidental artifact of the randomness in the process of data collection. The standard way of expressing the confidence in a result of analysis — such as a certain trait being correlated with a disease – is a p-value.
Informally, a p-value for a certain result measures the probability of obtaining the result in the absence of any actual relationship (referred to as the null hypothesis). A small p-value indicates high confidence in the validity of the result, with 0.05 commonly accepted as sufficient to declare a finding as being statistically significant.
The guarantee that a p-value provides has a critical caveat however: it applies only if the analysis procedure was chosen before the data was examined. At the same time, the practice of data analysis goes well beyond using a predetermined analysis. New analyses are chosen on the basis of data exploration and previous analyses of a dataset, and are performed on the same dataset. While useful, such adaptive analysis invalidates the standard p-value computations. Using incorrect p-values is likely to lead to false discoveries. An analyst might conclude, for example, that a certain diet increases the risk of diabetes, when in reality it has no effect at all.
The mistakes that misapplications of standard p-value computations cause can be easily observed using any basic data analysis tool (like MS Excel). For example, let’s create a fake dataset for a study of the effect of 50 different foods on the academic performance of students.
For each student, the data will include a consumption level for each of the 50 foods and an (academic) grade. Let’s create the dataset consisting of 100 students by choosing all the values randomly and independently from the normal (or Gaussian) distribution. Clearly, in the true data distribution we used the food consumption, and the students’ grade are completely unrelated. A natural first step in our analysis of this dataset is to identify which foods have the highest (positive or negative) correlation with the grade. Below is an example outcome of this step in which I highlighted in red three foods with particularly strong correlations in the data.
click to enlarge
A common next step would be to use the least-squares linear regression to check whether a simple linear combination of the three strongly correlated foods can predict the grade. It turns out that a little combination goes a long way: we discover that a linear combination of the three selected foods can explain a significant fraction of variance in the grade (plotted below). The regression analysis also reports that the p-value of this result is 0.00009 meaning that the probability of this happening purely by chance is less than 1 in 10,000.
Recall that no relationship exists in the true data distribution, so this discovery is clearly false. This spurious effect is known to experts as Freedman’s paradox. It arises since the variables (foods) used in the regression were chosen using the data itself.
Despite the fundamental nature of adaptivity in data analysis, little work has been done to understand and mitigate its effects on the validity of results. The only known safe approach to adaptive analysis is to use a separate holdout dataset to validate any finding obtained via adaptive analysis. Such an approach is standard in machine learning: a dataset is split into training and validation data, with the training set used for learning a predictor, and the validation (holdout) set used to estimate the accuracy of that predictor.
Because the predictor is independent of the holdout dataset, such an estimate is a valid estimate of the true prediction accuracy. In practice, however, the holdout dataset is rarely used only once, and the predictor often depends on the holdout data in a complicated way. Such dependence invalidates the estimates of accuracy based on the holdout set as the predictor may be overfitting to the holdout set.
I’ve been working on approaches for dealing with adaptivity in machine learning for the past two years. During a chance conversation with Aaron Roth, a professor at Penn, while at a big data workshop, I found out that he and several colleagues were also working on an approach to this problem. That conversation turned into a fruitful collaboration with Microsoft’s Cynthia Dwork, Google’s Moritz Hardt, University of Toronto’s Toni Pitassi, Samsung’s Omer Reingold, and Aaron.
We found that challenges of adaptivity can be addressed using techniques developed for privacy-preserving data analysis. These techniques rely on the notion of differential privacy that guarantees that the data analysis is not too sensitive to the data of any single individual. We rigorously demonstrated that ensuring differential privacy of an analysis also guarantees that the findings will be statistically valid. We then also developed additional approaches to the problem based on a new way to measure how much information an analysis reveals about a dataset.
The Thresholdout Algorithm
Using our new approach we designed an algorithm, called Thresholdout, that allows an analyst to reuse the holdout set of data for validating a large number of results, even when those results are produced by an adaptive analysis. The Thresholdout algorithm is very easy to implement and is based on two key ideas.
First, the validation should not reveal any information about the holdout dataset if the analyst does not overfit to the training set.
Second, an addition of a small amount of noise to any validation result can prevent the analyst from overfitting to the holdout set.
To illustrate the benefits of using our approach, we showed how it prevents overfitting in a setting inspired by Freedman’s paradox. In this experiment, the analyst wants to build an algorithm that can accurately classify data points into two classes, given a dataset of correctly labeled points. The analyst first finds a set of variables that have the largest correlation with the class. However, to avoid spurious correlations, the analyst validates the correlations on the holdout set and uses only those variables whose correlation agrees with the correlation on the training set. The analyst then creates a simple linear threshold classifier on the selected variables.
We tested this procedure on a dataset of 20,000 points in which the values of 10,000 attributes are drawn independently from the normal distribution, and the class is chosen uniformly at random. There is no correlation between the data point and its class label and no classifier can achieve true accuracy better than 50 percent. Nevertheless, reusing a standard holdout leads to reported accuracy of more than 63±0.4 percent (when selecting 500 out of 10,000 variables) on both the training set and the holdout set.
We then executed the same algorithm with our Thresholdout algorithm for holdout reuse. Thresholdout prevents the algorithm from overfitting to the holdout set and gives a valid result close to 50 percent estimate of classifier accuracy. In the plot below we show the accuracy on the training set and the holdout set (the green line) obtained for various numbers of selected variables averaged over 100 independent executions. For comparison, the plot also includes the accuracy of the classifier on another fresh data set of 10,000 points which serves as the gold standard for validation.
click to enlarge
Beyond this illustration, the reusable holdout gives the analyst a simple, general and principled method to perform multiple validation steps where previously the only known safe approach was to collect a fresh holdout set each time a function depends on the outcomes of previous validations. We are now looking forward to exploring the practical applications of this technique.