Decision tree growing

Decision tree growing is done by creating a decision tree from a data set. Splits are selected, and class labels are assigned to leaves when no further splits are required or possible.

The growing starts from a single root node, where a table that contains a training data set is used as input table. The table contains several columns that represent attributes and a single column that represents the class attribute.

The expected output is a representation of the decision tree model that is based on the training data set.

When you do a split, each of the created descendant nodes corresponds to the applicable subset of the training data set. Further splits of these nodes result in new nodes that correspond to smaller subsets of data sets, and so on. Nodes that are not split further become leaves.

Decision tree growing is a repetition of the following operations:

  1. Stopping criteria

    This operation determines whether a split is done, or whether a node becomes a leaf because it is not split further. The decision is based on the subset of training data sets that correspond to the node.

    No split is done if one of the conditions applies:

    • All instances in the corresponding subset are of the same class
    • The number of instances in the corresponding subset is less than a specified minimum
    • The level of the current node is greater than a specified maximum, where the level of the root node is 1, and the level of its descendant nodes is 2, and so on
    • The improvement of class impurity according to the best available split is less than a specified minimum
  2. Class label assignment

    The most frequent class in the corresponding subset is associated to each tree node. This operation is useful if the tree is subsequently pruned because subsequent pruning turns some nodes into leaves. It also makes the reading of the tree structure easier. The assignment of class labels is done for all nodes.

  3. Split selection

    This operation assigns minimum-impurity splits to nodes for which the stopping criteria were not satisfied. The set of candidate splits includes binary equality-based splits for all discrete attributes, and binary inequality-based splits for all continuous attributes. An impurity measure is applied to subsets corresponding to split outcomes to evaluate candidate splits.