Be Brave In Machine Learning
JeanFrancoisPuget 2700028FGP Visits (16233)
There is lots of confusion about the role of test data in machine learning. The typical outcome is overfitting, a plague that must be avoided at all reasonable cost. The confusion comes from blurring two, fundamentally different, roles for test data:
So far so good.
Things can go wrong when performance isn't great on hold out data. It is then tempting to get back to model selection, and try some other models until performance on hold out data is good enough.
Issue is that we just used hold out data as validation data. Therefore, performance on hold out data is no longer indicative of the performance we will get on new, unforeseen, data. This may result in disappointing outcomes when the model is deployed in a production environment.
The cure is to really use test data as sketched above. This recommendation is well documented in the machine learning literature. See for instance Elem
If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.
If this is not clear enough, here is a great slide by Kaggle grandmaster Owen :
This is spot on: if performance on hold out data is poor, then we must be brave enough to abandon the project instead of using the hold out data to select another model.