IBM Data WH Decision Tree Node - Tree Pruning
You can use the pruning options to specify pruning criteria for the decision tree. The intention of pruning is to reduce the risk of overfitting by removing overgrown subgroups that do not improve the expected accuracy on new data.
Pruning measure. The default pruning measure, Accuracy, ensures that the estimated accuracy of the model remains within acceptable limits after removing a leaf from the tree. Use the alternative, Weighted Accuracy, if you want to take the class weights into account while applying pruning.
Data for pruning. You can use some or all of the training data to estimate the expected accuracy on new data. Alternatively, you can use a separate pruning dataset from a specified table for this purpose.
- Use all training data. This option (the default) uses all the training data to estimate the model accuracy.
- Use % of training data for pruning. Use this option
to split the data into two sets, one for training and one for pruning, using the percentage
specified here for the pruning data.
Select Replicate results if you want to specify a random seed to ensure that the data is partitioned in the same way each time you run the stream. You can either specify an integer in the Seed used for pruning field, or click Generate, which will create a pseudo-random integer.
- Use data from an existing table. Specify the table name of a separate pruning dataset for estimating model accuracy. Doing so is considered more reliable than using training data. However, this option may result in the removal of a large subset of data from the training set, thus reducing the quality of the decision tree.