Validation
Validation allows you to assess how well your tree structure generalizes to a larger population. Two validation methods are available: crossvalidation and split-sample validation.
Crossvalidation
Crossvalidation divides the sample into a number of subsamples, or folds. Tree models are then generated, excluding the data from each subsample in turn. The first tree is based on all of the cases except those in the first sample fold, the second tree is based on all of the cases except those in the second sample fold, and so on. For each tree, misclassification risk is estimated by applying the tree to the subsample excluded in generating it.
- You can specify a maximum of 25 sample folds. The higher the value, the fewer the number of cases excluded for each tree model.
- Crossvalidation produces a single, final tree model. The crossvalidated risk estimate for the final tree is calculated as the average of the risks for all of the trees.
Split-Sample Validation
With split-sample validation, the model is generated using a training sample and tested on a hold-out sample.
- You can specify a training sample size, expressed as a percentage of the total sample size, or a variable that splits the sample into training and testing samples.
- If you use a variable to define training and testing samples, cases with a value of 1 for the variable are assigned to the training sample, and all other cases are assigned to the testing sample. The variable cannot be the dependent variable, weight variable, influence variable, or a forced independent variable.
- You can display results for both the training and testing samples or just the testing sample.
- Split-sample validation should be used with caution on small data files (data files with a small number of cases). Small training sample sizes may yield poor models, since there may not be enough cases in some categories to adequately grow the tree.
To Validate a Decision Tree
This feature requires the Decision Trees option.
- From the menus choose:
- In the main Decision Trees dialog, click Validation.
- Select Crossvalidation or Split-sample validation.
Note: Both validation methods randomly assign cases to sample groups. If you want to be able to reproduce the exact same results in a subsequent analysis, you should set the random number seed (Transform menu, Random Number Generators) before running the analysis for the first time and then reset the seed to that value for the subsequent analysis. See the topic Random Number Generators for more information.