IBM Data WH Decision Tree Build Options

The following build options are available for tree growth:

Growth Measure. These options control the way tree growth is measured.

  • Impurity Measure. This measure evaluates the best place to split the tree. It is a measurement of the variability in a subgroup or segment of data. A low impurity measurement indicates a group where most members have similar values for the criterion or target field.

    The supported measurements are Entropy and Gini. These measurements are based on probabilities of category membership for the branch.

  • Maximum tree depth. The maximum number of levels to which the tree can grow below the root node, that is, the number of times the sample is split recursively. The default value of this property is 10, and the maximal value that you can set for this property is 62.
    Note: If the viewer in the model nugget shows the textual representation of the model, a maximum of 12 levels of the tree is displayed.

Splitting Criteria. These options control when to stop splitting the tree.

  • Minimum improvement for splits. The minimum amount by which impurity must be reduced before a new split is created in the tree. The goal of tree building is to create subgroups with similar output values to minimize the impurity within each node. If the best split for a branch reduces the impurity by less than the amount that is specified by the splitting criteria, the branch is not split.
  • Minimum number of instances for a split. The minimum number of records that can be split. When fewer than this number of unsplit records remain, no further splits are made. You can use this field to prevent the creation of small subgroups in the tree.

Statistics. This parameter defines how many statistics are included in the model. Select one of the following options:

  • All. All column-related statistics and all value-related statistics are included.
    Note: This parameter includes the maximum number of statistics and might therefore affect the performance of your system. If you do not want to view the model in graphical format, specify None.
  • Columns. Column-related statistics are included.
  • None. Only statistics that are required to score the model are included.