Table of contents

Auto Numeric node

The Auto Numeric node estimates and compares models for continuous numeric range outcomes using a number of different methods, enabling you to try out a variety of approaches in a single modeling run. You can select the algorithms to use, and experiment with multiple combinations of options. For example, you could predict housing values using neural net, linear regression, C&RT, and CHAID models to see which performs best, and you could try out different combinations of stepwise, forward, and backward regression methods. The node explores every possible combination of options, ranks each candidate model based on the measure you specify, and saves the best for use in scoring or further analysis.

A municipality wants to more accurately estimate real estate taxes and to adjust values for specific properties as needed without having to inspect every property. Using the Auto Numeric node, the analyst can generate and compare a number of models that predict property values based on building type, neighborhood, size, and other known factors.
A single target field (with the role set to Target), and at least one input field (with the role set to Input). The target must be a continuous (numeric range) field, such as age or income. Input fields can be continuous or categorical, with the limitation that some inputs may not be appropriate for some model types. For example, C&R Tree models can use categorical string fields as inputs, while linear regression models cannot use these fields and will ignore them if specified. The requirements are the same as when using the individual modeling nodes. For example, a CHAID model works the same whether generated from the CHAID node or the Auto Numeric node.
Frequency and weight fields
Frequency and weight are used to give extra importance to some records over others because, for example, the user knows that the build dataset under-represents a section of the parent population (Weight) or because one record represents a number of identical cases (Frequency). If specified, a frequency field can be used by C&R Tree and CHAID algorithms. A weight field can be used by C&RT, CHAID, Regression, and GenLin algorithms. Other model types will ignore these fields and build the models anyway. Frequency and weight fields are used only for model building and are not considered when evaluating or scoring models.
If you attach a table node to the nugget for the Auto Numeric Node, there are several new variables in the table with names that begin with a $ prefix.
The names of the fields that are generated during scoring are based on the target field, but with a standard prefix. Different model types use different sets of prefixes.
For example, the prefixes $G, $R, $C are used as the prefix for predictions that are generated by the Generalized Linear model, CHAID model, and C5.0 model, respectively. $X is typically generated by using an ensemble, and $XR, $XS, and $XF are used as prefixes in cases where the target field is a Continuous, Categorical, or Flag field, respectively.
$..E prefixes are used for the prediction confidence of a Continuous target; for example, $XRE is used as a prefix for ensemble Continuous prediction confidence. $GE is the prefix for a single prediction of confidence for a Generalized Linear model.

Supported model types

Supported model types include Neural Net, C&R Tree, CHAID, Regression, GenLin, Nearest Neighbor, SVM, XGBoost Linear, GLE, and XGBoost-AS.

Cross-validation settings

In the node properties, note that cross-validation settings are available. Cross-validation is a valuable technique for testing the effectiveness (avoiding overfitting) of machine learning models, and it's also a re-sampling procedure you can use to evaluate a model if you have limited data.

K-fold is a popular and easy way to perform cross-validation. It generally results in a less biased model compared to a single train/test partition, because it ensures that every observation from the original dataset has the chance of appearing in training and test sets. The general procedure of k-fold cross-validation is as follows:
  1. Shuffle the dataset randomly.
  2. Split the dataset into k-folds/groups.
  3. For each unique fold/group:
    1. Take the fold/group as a hold out or test dataset.
    2. Take the remaining groups as a training dataset.
    3. Fit a model on the training set and evaluate it on the test set.
    4. Retain the evaluation score and discard the model.
  4. Summarize the overall evaluation of the model using the retained k-fold evaluation scores.

Cross-validation is currently supported via the Auto Classifier node and the Auto Numeric node. Double-click the node to open its properties. By selecting the Cross-validate option, a single train/test partition is disabled and the Auto nodes will use k-fold cross-validation to evaluate the selected set of different algorithms.

You can specify the Number of folds (K), The default is 5, with a range of 3 to 10. If you want to retain repeatable sampling during cross-validation, to have consistent final evaluation measures for generated models across different executions, you can select the Repeatable Cross Validation partition assignment option. You can also set the Random seed to a specific value so the resulting model is exactly reproducible. Or click Generate to always generate the same sequence of random values, in which case running the node always yields the same generated model.