Correcting Overtraining
Thinking back on the logistic regression analysis previously performed, the loan officer recalls that the training and holdout samples correctly predicted a similar percentage of cases, about 80%. By contrast, the neural network had a higher percentage of correct cases in the training sample, with the holdout sample doing a considerably worse job of predicting customers that actually defaulted (45.8% correct for the holdout sample versus 59.7% for the training sample). Combined with the stopping rule reported in the model summary table, this makes you suspect that the network may be overtraining; that is, it is chasing spurious patterns that appear in the training data by random variation.
Fortunately, the solution is relatively simple: specify a testing sample to help keep the network "on track." We created the partition variable so that it would exactly recreate the training and holdout samples used in the logistic regression analysis; however, logistic regression has no concept of a "testing" sample. Let's take a portion of the training sample and reassign it to a testing sample.