Overview (TREE command)
The TREE
procedure creates a tree-based model.
It classifies cases into groups or predicts values of a dependent
variable based on values of predictor variables. The procedure provides
validation tools for exploratory and confirmatory classification analysis.
Options
Model. You can specify the dependent (target) variable and one or more independent (predictor) variables. Optionally you can force one independent variable into the model as the first variable.
Growing Method. Four growing algorithms are available: CHAID (the default), Exhaustive CHAID, CRT, and QUEST. Each performs a type of recursive splitting. First, all predictors are examined to find the one that gives the best classification or prediction by splitting the sample into subgroups (nodes). The process is applied recursively, dividing the subgroups into smaller and smaller groups. It stops when one or more stopping criteria are met.
The four growing methods have different performance characteristics and features:
- CHAID chooses predictors that have the strongest interaction with the dependent variable. Predictor categories are merged if they are not significantly different with respect to the dependent variable (Kass, 1980).
- Exhaustive CHAID is a modification of CHAID that examines all possible splits for each predictor (Biggs et al., 1991).
- CRT is a family of methods that maximizes within-node homogeneity (Breiman et al., 1984).
- QUEST trees are computed rapidly, but the method is available only if the dependent variable is nominal. (Loh and Shih, 1997).
Stopping Criteria. You can set parameters that limit the size of the tree and control the minimum number of cases in each node.
Validation. You can assess how well your tree structure generalizes to a larger sample. Split-sample partitioning and cross-validation are supported. Partitioning divides your data into a training sample, from which the tree is grown, and a testing sample, on which the tree is tested. Cross-validation involves dividing the sample into a number of smaller samples. Trees are generated excluding the data from each subsample in turn. For each tree, misclassification risk is estimated using data for the subsample that was excluded in generating it. A cross-validated risk estimate is calculated as the average risk across trees.
Output. Default output includes a tree diagram and risk statistics. Classification accuracy is reported if the dependent variable is categorical. Optionally, you can obtain charts of gain- and profit-related measures as well as classification rules that can be used to select or score new cases. You can also save the model’s predictions to the active dataset, including assigned segment (node), predicted class/value, and predicted probability.
Basic Specification
- The basic specification is a dependent variable and one or more independent variables.
Operations
- The tree is grown until one or more stopping criteria are met. The default growing method is CHAID.
- The type of model depends on the measurement level of the dependent variable. If the dependent variable is scale (continuous), a prediction model is computed. If it is categorical (nominal or ordinal), a classification model is generated.
- Measurement level determines allowable combinations of predictor values within a node. For ordinal and scale predictors, only adjacent categories/values may occur in a node. There are no restrictions on grouping of nominal categories.
-
TREE
honors theSET SEED
value if split-sample model validation is requested. -
SPLIT FILE
is ignored by theTREE
procedure. - If a
WEIGHT
variable is defined the weights are treated as replication weights. Fractional weights are rounded.
Syntax Rules
- The minimum specification is a dependent variable, the keyword
BY
and one or more independent variables. - All subcommands are optional.
- Only a single instance of each subcommand is allowed.
- A keyword may be specified only once within a subcommand.
- Equals signs (=) shown in the syntax chart are required.
- Subcommand names and keywords must be spelled in full.
- Subcommands may be used in any order.
-
SPLIT FILE
is ignored by theTREE
procedure. - CHAID and Exhaustive CHAID: A categorical dependent variable may not have more than 126 categories. If the dependent variable is categorical, then the limit for a categorical predictor is also 126 categories.
- CRT: A nominal predictor may not have more than 32 categories.
- QUEST: If a predictor is nominal, then the limit for the dependent variable (which must be nominal) is 127 categories. A nominal predictor may not have more than 25 categories.
Examples
TREE risk BY income age creditscore employment.
- A tree model is computed that estimates credit risk using an individual’s income, age, credit score, and employment category as predictor variables.
- The default method, CHAID, is used to grow the tree.
- Since measurement level is not specified, it is obtained from the data dictionary for each model variable. If no measurement levels have been defined, numeric variables are treated as scale and string variables are treated as nominal.
TREE risk [o] BY income [o] age [s] creditscore [s] employment [n]
/METHOD TYPE=CRT
/VALIDATION TYPE=SPLITSAMPLE
/SAVE NODEID PREDVAL.
- A tree model is computed that estimates credit risk using an individual’s income, age, credit score, and employment category as predictor variables.
- Age and credit score will be treated as scale variables, risk and income as ordinal, and employment category as nominal.
- The CRT method, which performs binary splits, is used to grow the tree.
- Split-sample validation is requested. By default, 50% of cases are assigned to the training sample. Remaining cases are used to validate the tree.
- Two variables are saved to the active dataset: node (segment) identifier and predicted value.