*Science*, deals with important, but fairly technical and subtle, statistical issues that arise in the practice of data analysis. I’ll outline them below. For the less patient, here is also an elevator-pitch description (minus the elevator) of our work in the video, below.

*p*-value.

*p*-value for a certain result measures the probability of obtaining the result in the absence of any actual relationship (referred to as the

*null hypothesis*). A small

*p*-value indicates high confidence in the validity of the result, with 0.05 commonly accepted as sufficient to declare a finding as being statistically significant.

*p*-value provides has a critical caveat however: it applies only if the analysis procedure was chosen before the data was examined. At the same time, the practice of data analysis goes well beyond using a predetermined analysis. New analyses are chosen on the basis of data exploration and previous analyses of a dataset, and are performed on the same dataset. While useful, such

*adaptive*analysis invalidates the standard

*p*-value computations. Using incorrect

*p*-values is likely to lead to false discoveries. An analyst might conclude, for example, that a certain diet increases the risk of diabetes, when in reality it has no effect at all.

*p*-value computations cause can be easily observed using any basic data analysis tool (like MS Excel). For example, let’s create a fake dataset for a study of the effect of 50 different foods on the academic performance of students.

click to enlarge |

A common next step would be to use the least-squares linear regression to check whether a simple linear combination of the three strongly correlated foods can predict the grade. It turns out that a little combination goes a long way: we discover that a linear combination of the three selected foods can explain a significant fraction of variance in the grade (plotted below). The regression analysis also reports that the *p*-value of this result is 0.00009 meaning that the probability of this happening purely by chance is less than 1 in 10,000.

*holdout*dataset to validate any finding obtained via adaptive analysis. Such an approach is standard in machine learning: a dataset is split into training and validation data, with the training set used for learning a predictor, and the validation (holdout) set used to estimate the accuracy of that predictor.

*overfitting*to the holdout set.

**The Thresholdout Algorithm**

- First, the validation should not reveal any information about the holdout dataset if the analyst does not overfit to the training set.
- Second, an addition of a small amount of noise to any validation result can prevent the analyst from overfitting to the holdout set.