VALIDATION Subcommand (TREE command)

The VALIDATION subcommand allows you to assess how well your tree structure generalizes to a larger population.

Split-sample validation and cross-validation are available.
By default, validation is not performed.
If you want to be able to reproduce validated results later, use SET SEED before the TREE procedure to initial the random number seed.

Each keyword in the subcommand is followed by an equals sign (=) and the value for that keyword.

Example

TREE risk [o] BY income age creditscore
 /VALIDATION TYPE=SPLITSAMPLE(25) OUTPUT=TESTSAMPLE.

TYPE Keyword

NONE. The tree model is not validated. This is the default.

SPLITSAMPLE(percent). Split-sample validation. The model is generated using a training sample and tested on a hold-out sample. The value or variable specified in parentheses determines the training sample size. Enter a percent value greater than 0 and less than 100, or a numeric variable, the values of which determine how cases are assigned to the training or testing samples: cases with a value of 1 for the variable are assigned to the training sample, and all other cases are assigned to the testing sample. The variable cannot be the dependent variable, weight variable, influence variable or a forced independent variable. See the topic INFLUENCE Subcommand (TREE command) for more information. Note: Split-sample validation should be used with caution on small data files (data files with a small number of cases). Small training sample sizes may yield poor models since there may not be enough cases in some categories to adequately grow the tree.

CROSSVALIDATION (value). Crossvalidate the tree model. The sample is divided into a number of subsamples, or folds. Tree models are then generated excluding the data from each subsample in turn. The first tree is based on all of the cases except those in the first sample fold, the second tree is based on all of the cases except those in the second sample fold, and so on. For each tree, misclassification risk is estimated by applying the tree to the subsample excluded in generating it. Specify a positive integer between 2 and 25 in parentheses. The higher the value, the fewer the number of cases excluded for each tree model. Crossvalidation produces a single, final tree model. The cross-validated risk estimate for the final tree is calculated as the average of the risks for all of the trees. CROSSVALIDATION is ignored with a warning if FORCE is also specified.

OUTPUT Keyword

With split-sample validation, the OUTPUT keyword controls the output generated. This setting is ignored if SPLITSAMPLE is not specified.

BOTHSAMPLES. Output is produced for training and test samples. This is the default. Choose this option if you want to compare results for each partition.

TESTSAMPLE. Output is produced for the test sample only. Choose this option if you want validated results only.