C5.0 Node Model Options
This feature is available in SPSS® Modeler Professional and SPSS Modeler Premium.
Model name. Specify the name of the model to be produced.
- Auto. With this option selected, the model name will be generated automatically, based on the target field name(s). This is the default.
- Custom. Select this option to specify your own name for the model nugget that will be created by this node.
Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model.
Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. See Building Split Models for more information.
Output type. Specify here whether you want the resulting model nugget to be a Decision tree or a Rule set.
Group symbolics. If this option is selected, C5.0 will attempt to combine symbolic values that have similar patterns with respect to the output field. If this option is not selected, C5.0 will create a child node for every value of the symbolic field used to split the parent node. For example, if C5.0 splits on a COLOR field (with values RED, GREEN, and BLUE), it will create a three-way split by default. However, if this option is selected, and the records where COLOR = RED are very similar to records where COLOR = BLUE, it will create a two-way split, with the GREENs in one group and the BLUEs and REDs together in the other.
Use boosting. The C5.0 algorithm has a special method for improving its accuracy rate, called boosting. It works by building multiple models in a sequence. The first model is built in the usual way. Then, a second model is built in such a way that it focuses on the records that were misclassified by the first model. Then a third model is built to focus on the second model's errors, and so on. Finally, cases are classified by applying the whole set of models to them, using a weighted voting procedure to combine the separate predictions into one overall prediction. Boosting can significantly improve the accuracy of a C5.0 model, but it also requires longer training. The Number of trials option enables you to control how many models are used for the boosted model.
Cross-validate. If this option is selected, C5.0 will use a set of models built on subsets of the training data to estimate the accuracy of a model built on the full dataset. This is useful if your dataset is too small to split into traditional training and testing sets. The cross-validation models are discarded after the accuracy estimate is calculated. You can specify the number of folds, or the number of models used for cross-validation. Note that in previous versions of IBM® SPSS Modeler, building the model and cross-validating it were two separate operations. In the current version, no separate model-building step is required. Model building and cross-validation are performed at the same time.
Mode. For Simple training, most of the C5.0 parameters are set automatically. Expert training allows more direct control over the training parameters.
Simple Mode Options
Favor. By default, C5.0 will try to produce the most accurate tree possible. In some instances, this can lead to overfitting, which can result in poor performance when the model is applied to new data. Select Generality to use algorithm settings that are less susceptible to this problem.
Note: Models built with the Generality option selected are not guaranteed to generalize better than other models. When generality is a critical issue, always validate your model against a held-out test sample.
Expected noise (%). Specify the expected proportion of noisy or erroneous data in the training set.
Expert Mode Options
Pruning severity. Determines the extent to which the decision tree or rule set will be pruned. Increase this value to obtain a smaller, more concise tree. Decrease it to obtain a more accurate tree. This setting affects local pruning only (see "Use global pruning" below).
Minimum records per child branch. The size of subgroups can be used to limit the number of splits in any branch of the tree. A branch of the tree will be split only if two or more of the resulting subbranches would contain at least this many records from the training set. The default value is 2. Increase this value to help prevent overtraining with noisy data.
Use global pruning. Trees are pruned in two stages: First, a local pruning stage, which examines subtrees and collapses branches to increase the accuracy of the model. Second, a global pruning stage considers the tree as a whole, and weak subtrees may be collapsed. Global pruning is performed by default. To omit the global pruning stage, deselect this option.
Winnow attributes. If this option is selected, C5.0 will examine the usefulness of the predictors before starting to build the model. Predictors that are found to be irrelevant are then excluded from the model-building process. This option can be helpful for models with many predictor fields and can help prevent overfitting.