Decision tree

Decision trees are more complex models than the one-way and two-way drivers. They extend the sequence as the combination models. The main difference is that decision trees support discovery of interaction among multiple predictors and therefore deeper insights than the drivers.

Overview

Given the target field, the algorithm searches across all other data fields and adds them to the model to improve its strength in predicting the target values. The search across different predictors is iterative; after the search adds one predictor, the search continuous to add the next predictor that improves the model the most. The goal is to find the best set of predictors and an optimal way of combining them so that an optimal model is computed. The insights that are obtained from decision trees are presented in the form of decision rules where combination of predictors and corresponding values provide a single prediction for the target value. Decision rules are ranked by strength so that you can easily find the rules that are the most relevant and interesting. Decision rules that are generated by the decision tree are mutually exclusive. The decision rules also provide a complete rule set such that a corresponding rule exists for any combination of the predictor values in the data. Also, available is the overall decision tree predictive strength that provides relative improvement over the basic model. The results are available through three different visualizations: sunburst, tree, and decision rules. They each have certain advantages by displaying the decision tree structure and the corresponding decision rules content. Overall decision tree predictive strength is also available in the driver analysis visualization.

Algorithms

The decision tree model is computed after data preparation and building all the one-way drivers. The first tree predictor is selected as the top one-way driver. Categories of the predictor are merged when the adverse impact on the predictive strength is smaller than a certain threshold. The next step is to find the best predictor to split each tree node that consists of the merged categories. The process is continued until a stopping rule applies to a tree node. Possible options for stopping are that all categories for every candidate predictor are merged into a single node or that the number of nodes exceeds maximum number of nodes. Categories with fewer than a minimum number of rows are always merged with another category. This means that none of the nodes in the tree can contain fewer than the minimum number of rows. The same procedure is used for continuous and categorical targets, only the impurity function is different.

Details

Impurity functions

Impurity function values are used as the main criterion for splitting and merging potential tree nodes. Impurity function total for continuous trees is the sum of squares per node, while Gini impurity measure is used for the categorical targets. Gini impurity total is computed as a sum of squares of count proportions across all target categories per node that is subtracted from one and the results that are multiplied by the number of rows. Improvement in impurity function value is information gain.

When splitting each node IBM® Cognos Analytics with Watson looks for a predictor field with a largest information gain computed as total impurity across all potential children nodes subtracted from the parent node impurity. Before Cognos Analytics selects the predictor, Cognos Analytics attempts to merge some of the potential children nodes that initially correspond to each predictor category. Information loss is computed by subtracting impurity of non-merged nodes from the impurity of merged nodes. As long as information loss is smaller than a threshold, the nodes are merged. This process helps to create relatively small trees that are easy to visualize and comprehend while still preserving the overall strength of the tree.

Stopping rules

Candidate nodes are always merged if they are based on fewer than 25 rows. If all categories of a predictor are merged, it cannot be used for splitting a certain node. When none of the predictors can split the specific node, the process stops for the node. The overall process of generating the tree stops when none of the nodes can be split or when the number of generated nodes exceeds 36.

Variable importance

Variable importance corresponds to a relative tree error reduction when the corresponding predictor is included in the tree. It is computed by comparing the errors of an initial tree and a restricted tree that is generated by the rest of the predictors in the initial tree. The error of the initial tree is subtracted from the error of the restricted tree and the result is divided by the error of the restricted tree. Variables with zero or negative importance are removed from the tree. The tree error is computed as the sum of squares for continuous targets and as classification error for categorical targets.

Predictive Strength

Predictive strength for tree with continuous target is computed similarly to key drivers. The contents of leaf nodes are considered. Variance contribution for each leaf node is added and divided by the overall variance for the data. This is relative error for the tree. It is subtracted from one to obtain predictive strength that is compatible with the R-squared measure that is used for key drivers.

For categorical targets, Cognos Analytics computes classification accuracy based on the classification error that is added from all the leaf nodes. Relative classification accuracy improvement over the basic model, also known as adjusted count R-square, is reported as the tree predictive strength. It is computed by subtracting the tree error from the basic model error and dividing the result by the basic model error. For example, the classification accuracy of the model can be as high as 95%, but if the majority class appears for 90% of the rows in the data, then the predictive strength of the tree is reported as 50% only. This is parallel to the continuous target case where the basic model is represented by the overall mean value. Predictive strength that is measured by R-squared is based on the tree relative improvement in reducing the overall variance.

Cognos Analytics displays only the trees that have predictive strength larger than 10%. A tree for continuous target is displayed in a driver analysis or spiral visualization if its predictive strength is higher than the predictive strength of the strongest key driver. Otherwise, it is not displayed in these charts since the key drivers provide all the relevant insights already.

Predictive strength for a decision tree is computed that uses the same data that is used for generating the decision tree. This is known to introduce bias and provide optimistic estimates of the decision tree performance on a similar data from the same data source. Cognos Analytics reduced the discrepancy by tuning the algorithm so that overfitting the training data is minimized.