IBM® Db2® for z/OS® models - Regression Tree build options - Tree Pruning

You can use the pruning options to specify pruning criteria for the regression tree. The intention of pruning is to reduce the risk of overfitting by removing overgrown subgroups that do not improve the expected accuracy on new data.

Pruning measure. The pruning measure ensures that the estimated accuracy of the model remains within acceptable limits after removing a leaf from the tree. You can select one of the following measures.

  • mse. Mean squared error - (default) measures how close a fitted line is to the data points.
  • r2. R-squared - measures the proportion of variation in the dependent variable explained by the regression model.
  • Pearson. Pearson's correlation coefficient - measures the strength of relationship between linearly dependent variables that are normally distributed.
  • Spearman. Spearman's correlation coefficient - detects nonlinear relationships that appear weak according to Pearson’s correlation, but which may actually be strong.

Data for pruning. You can use some or all of the training data to estimate the expected accuracy on new data. Alternatively, you can use a separate pruning dataset from a specified table for this purpose.

  • Use all training data. This option (the default) uses all the training data to estimate the model accuracy.
  • Use % of training data for pruning. Use this option to split the data into two sets, one for training and one for pruning, using the percentage specified here for the pruning data.

    Select Replicate results if you want to specify a random seed to ensure that the data is partitioned in the same way each time you run the stream. You can either specify an integer in the Seed used for pruning field, or click Generate, which will create a pseudo-random integer.

  • Use data from an existing table. Specify the table name of a separate pruning dataset for estimating model accuracy. Doing so is considered more reliable than using training data.