VALIDATION Subcommand (TREE command)
The VALIDATION
subcommand allows you to assess how well your tree structure generalizes
to a larger population.
- Split-sample validation and cross-validation are available.
- By default, validation is not performed.
- If you want to be able to reproduce validated results
later, use
SET SEED
before theTREE
procedure to initial the random number seed.
Each keyword in the subcommand is followed by an equals sign (=) and the value for that keyword.
Example
TREE risk [o] BY income age creditscore
/VALIDATION TYPE=SPLITSAMPLE(25) OUTPUT=TESTSAMPLE.
TYPE Keyword
NONE. The tree model is not validated. This is the default.
SPLITSAMPLE(percent). Split-sample validation. The model is generated using a training sample and tested on a hold-out sample. The value or variable specified in parentheses determines the training sample size. Enter a percent value greater than 0 and less than 100, or a numeric variable, the values of which determine how cases are assigned to the training or testing samples: cases with a value of 1 for the variable are assigned to the training sample, and all other cases are assigned to the testing sample. The variable cannot be the dependent variable, weight variable, influence variable or a forced independent variable. See the topic INFLUENCE Subcommand (TREE command) for more information. Note: Split-sample validation should be used with caution on small data files (data files with a small number of cases). Small training sample sizes may yield poor models since there may not be enough cases in some categories to adequately grow the tree.
CROSSVALIDATION (value). Crossvalidate
the tree model. The sample is divided into a number of
subsamples, or folds. Tree
models are then generated excluding the data from each subsample in
turn. The first tree is based on all of the cases except those in the first sample fold, the
second tree is based on all of the cases except those in the second
sample fold, and so on. For each tree, misclassification risk is estimated
by applying the tree to the subsample excluded in generating it.
Specify a positive integer between 2 and 25 in parentheses. The
higher the value, the fewer the number of cases excluded for each
tree model. Crossvalidation produces a single, final tree model.
The cross-validated risk estimate for the final tree is calculated
as the average of the risks for all of the trees. CROSSVALIDATION
is ignored with a warning
if FORCE
is also specified.
OUTPUT Keyword
With split-sample validation, the OUTPUT
keyword controls the output generated.
This setting is ignored if SPLITSAMPLE
is not specified.
BOTHSAMPLES. Output is produced for training and test samples. This is the default. Choose this option if you want to compare results for each partition.
TESTSAMPLE. Output is produced for the test sample only. Choose this option if you want validated results only.