Auto Numeric node model options
The Model tab of the Auto Numeric node enables you to specify the number of models to be saved, along with the criteria used to compare models.
Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name.
Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model.
Cross-validate. Cross-validation gives the model a dataset of known data on which to run training (a training dataset), and a dataset of unknown data to test the model against (validation dataset or testing set). The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias.
Create split models. Builds a separate model for each possible value of input fields that are specified as split fields. See Building Split Models for more information.
Rank models by. Specifies the criteria used to compare models.
- Correlation. The Pearson Correlation between the observed value for each record and the value predicted by the model. The correlation is a measure of linear association between two variables, with values closer to 1 indicating a stronger relationship. (Correlation values range between –1, for a perfect negative relationship, and +1 for a perfect positive relationship. A value of 0 indicates no linear relationship, while a model with a negative correlation would rank lowest of all.)
- Number of fields. The number of fields used as predictors in the model. Choosing models that use fewer fields may streamline data preparation and improve performance in some cases.
- Relative error. The relative error is the ratio of the variance of the observed values from those predicted by the model to the variance of the observed values from the mean. In practical terms, it compares how well the model performs relative to a null or intercept model that simply returns the mean value of the target field as the prediction. For a good model, this value should be less than 1, indicating that the model is more accurate than the null model. A model with a relative error greater than 1 is less accurate than the null model and is therefore not useful. For linear regression models, the relative error is equal to the square of the correlation and adds no new information. For nonlinear models, the relative error is unrelated to the correlation and provides an additional measure for assessing model performance.
Rank models using. If a partition is in use, you can specify whether ranks are based on the training partition or the testing partition. With large datasets, use of a partition for preliminary screening of models may greatly improve performance.
Number of models to use. Specifies the maximum number of models to be shown in the model nugget produced by the node. The top-ranking models are listed according to the specified ranking criterion. Increasing this limit will enable you to compare results for more models but may slow performance. The maximum allowable value is 100.
Calculate predictor importance. For models that produce an appropriate measure of importance, you can display a chart that indicates the relative importance of each predictor in estimating the model. Typically you will want to focus your modeling efforts on the predictors that matter most, and consider dropping or ignoring those that matter least. Note that predictor importance may extend the time needed to calculate some models, and is not recommended if you simply want a broad comparison across many different models. It is more useful once you have narrowed your analysis to a handful of models that you want to explore in greater detail. See Predictor Importance for more information.
Calculate ensemble distribution graph. Controls whether ensemble distribution graphs are included in the generated auto model output. When turned off, auto modeling performance is improved.
Do not keep models if. Specifies threshold values for correlation, relative error, and number of fields used. Models that fail to meet any of these criteria will be discarded and will not be listed in the summary report.
- Correlation less than. The minimum correlation (in terms of absolute value) for a model to be included in the summary report.
- Number of fields used is greater than. The maximum number of fields to be used by any model to be included.
- Relative error is greater than. The maximum relative error for any model to be included.
Optionally, you can configure the node to stop execution the first time a model is generated that meets all specified criteria. See the topic Automated Modeling Node Stopping Rules for more information.