TwoStep Cluster Node Model Options
Model name. You can generate the model name automatically based on the target or ID field (or model type in cases where no such field is specified) or specify a custom name.
Use partitioned data. If a partition field is defined, this option ensures that data from only the training partition is used to build the model.
Standardize numeric fields. By default, TwoStep will standardize all numeric input fields to the same scale, with a mean of 0 and a variance of 1. To retain the original scaling for numeric fields, deselect this option. Symbolic fields are not affected.
Exclude outliers. If you select this option, records that don't seem to fit into a substantive cluster will be automatically excluded from the analysis. This prevents such cases from distorting the results.
Outlier detection occurs during the preclustering step. When this option is selected, subclusters with few records relative to other subclusters are considered potential outliers, and the tree of subclusters is rebuilt excluding those records. The size below which subclusters are considered to contain potential outliers is controlled by the Percentage option. Some of those potential outlier records can be added to the rebuilt subclusters if they are similar enough to any of the new subcluster profiles. The rest of the potential outliers that cannot be merged are considered outliers and are added to a "noise" cluster and excluded from the hierarchical clustering step.
When scoring data with a TwoStep model that uses outlier handling, new cases that are more than a certain threshold distance (based on the log-likelihood) from the nearest substantive cluster are considered outliers and are assigned to the "noise" cluster with the name -1.
Cluster label. Specify the format for the generated cluster membership field. Cluster
membership can be indicated as a String with the specified Label
prefix (for example, "Cluster 1"
, "Cluster 2"
, and so
on) or as a Number.
Automatically calculate number of clusters. TwoStep cluster can very rapidly analyze a large number of cluster solutions to choose the optimal number of clusters for the training data. Specify a range of solutions to try by setting the Maximum and the Minimum number of clusters.
Specify number of clusters. If you know how many clusters to include in your model, select this option and enter the number of clusters.
Distance measure. This selection determines how the similarity between two clusters is computed.
- Log-likelihood. The likelihood measure places a probability distribution on the variables. Continuous variables are assumed to be normally distributed, while categorical variables are assumed to be multinomial. All variables are assumed to be independent.
- Euclidean. The Euclidean measure is the "straight line" distance between two clusters. It can be used only when all of the variables are continuous.
Clustering Criterion. This selection determines how the automatic clustering algorithm determines the number of clusters. Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified.