# Preserving Validity in Adaptive Data Analysis

August 6, 2015 | Written by: IBM Research Editorial Staff

Share this post:

*Science*, deals with important, but fairly technical and subtle, statistical issues that arise in the practice of data analysis. I’ll outline them below. For the less patient, here is also an elevator-pitch description (minus the elevator) of our work in the video, below.

*p*-value.

*p*-value for a certain result measures the probability of obtaining the result in the absence of any actual relationship (referred to as the

*null hypothesis*). A small

*p*-value indicates high confidence in the validity of the result, with 0.05 commonly accepted as sufficient to declare a finding as being statistically significant.

*p*-value provides has a critical caveat however: it applies only if the analysis procedure was chosen before the data was examined. At the same time, the practice of data analysis goes well beyond using a predetermined analysis. New analyses are chosen on the basis of data exploration and previous analyses of a dataset, and are performed on the same dataset. While useful, such

*adaptive*analysis invalidates the standard

*p*-value computations. Using incorrect

*p*-values is likely to lead to false discoveries. An analyst might conclude, for example, that a certain diet increases the risk of diabetes, when in reality it has no effect at all.

*p*-value computations cause can be easily observed using any basic data analysis tool (like MS Excel). For example, let’s create a fake dataset for a study of the effect of 50 different foods on the academic performance of students.

click to enlarge |

A common next step would be to use the least-squares linear regression to check whether a simple linear combination of the three strongly correlated foods can predict the grade. It turns out that a little combination goes a long way: we discover that a linear combination of the three selected foods can explain a significant fraction of variance in the grade (plotted below). The regression analysis also reports that the *p*-value of this result is 0.00009 meaning that the probability of this happening purely by chance is less than 1 in 10,000.

*holdout*dataset to validate any finding obtained via adaptive analysis. Such an approach is standard in machine learning: a dataset is split into training and validation data, with the training set used for learning a predictor, and the validation (holdout) set used to estimate the accuracy of that predictor.

*overfitting*to the holdout set.

**The Thresholdout Algorithm**

- First, the validation should not reveal any information about the holdout dataset if the analyst does not overfit to the training set.
- Second, an addition of a small amount of noise to any validation result can prevent the analyst from overfitting to the holdout set.

### A new supercomputing-powered weather model may ready us for Exascale

In the U.S. alone, extreme weather caused some 297 deaths and $53.5 billion in economic damage in 2016. Globally, natural disasters caused $175 billion in damage. It’s essential for governments, business and people to receive advance warning of wild weather in order to minimize its impact, yet today the information we get is limited. Current […]

### DREAM Challenge results: Can machine learning help improve accuracy in breast cancer screening?

Breast Cancer is the most common cancer in women. It is estimated that one out of eight women will be diagnosed with breast cancer in their lifetime. The good news is that 99 percent of women whose breast cancer was detected early (stage 1 or 0) survive beyond five years after […]

### Computational Neuroscience

New Issue of the IBM Journal of Research and Development Understanding the brain’s dynamics is of central importance to neuroscience. Our ability to observe, model, and infer from neuroscientific data the principles and mechanisms of brain dynamics determines our ability to understand the brain’s unusual cognitive and behavioral capabilities. Our guest editors, James Kozloski, […]