The Expert tab of the Auto Classifier node enables you to apply a partition
(if available), select the algorithms to use, and specify stopping rules.
Select models. By default, all models are selected
to be built; however, if you have
Analytic Server, you can
choose to restrict the models to those that can run on
Analytic Server and
preset them so that they either build split models or are ready to process very large data
sets.
Note: Local building of Analytic Server models
within the Auto Classifier node is not
supported.
Models used. Use the check boxes in the column on the left to
select the model types (algorithms) to include in the comparison. The more types you select, the
more models will be created and the longer the processing time will be.
Model type. Lists the available algorithms (see below).
Model parameters. For each model type, you can use the default
settings or select Specify to choose options for each model type. The
specific options are similar to those available in the separate modeling nodes, with the difference
that multiple options or combinations can be selected. For example, if comparing Neural Net models,
rather than choosing one of the six training methods, you can choose all of them to train six models
in a single pass.
Number of models. Lists the number of models produced for each
algorithm based on current settings. When combining options, the number of models can quickly add
up, so paying close attention to this number is strongly recommended, particularly when using large
datasets.
Restrict maximum time spent building a single model. (K-Means,
Kohonen, TwoStep, SVM, KNN, Bayes Net and Decision List models only) Sets a maximum time limit for
any one model. For example, if a particular model requires an unexpectedly long time to train
because of some complex interaction, you probably don't want it to hold up your entire modeling
run.
Note: If the target is a nominal (set) field, the Decision List option is unavailable.
Supported Algorithms
|
The Support Vector Machine (SVM) node enables you to classify data into one of two groups
without overfitting. SVM works well with wide data sets, such as those with a very large number of
input fields.
|
|
The k-Nearest Neighbor (KNN) node associates a new case with the category or value of
the k objects nearest to it in the predictor space, where k is an integer. Similar
cases are near each other and dissimilar cases are distant from each other.
|
|
Discriminant analysis makes more stringent assumptions than logistic regression but can be a
valuable alternative or supplement to a logistic regression analysis when those assumptions are met.
|
|
The Bayesian Network node enables you to build a probability model by combining observed and
recorded evidence with real-world knowledge to establish the likelihood of occurrences. The node
focuses on Tree Augmented Naïve Bayes (TAN) and Markov Blanket networks that are primarily used for
classification.
|
|
The Decision List node identifies subgroups, or segments, that show a higher or lower
likelihood of a given binary outcome relative to the overall population. For example, you might look
for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You
can incorporate your business knowledge into the model by adding your own custom segments and
previewing alternative models side by side to compare the results. Decision List models consist of a
list of rules in which each rule has a condition and an outcome. Rules are applied in order, and the
first rule that matches determines the outcome.
|
|
Logistic regression is a statistical technique for classifying records based on values of
input fields. It is analogous to linear regression but takes a categorical target field instead of a
numeric range.
|
|
The CHAID node generates decision trees using chi-square statistics to identify optimal
splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate nonbinary trees, meaning that
some splits have more than two branches. Target and input fields can be numeric range (continuous)
or categorical. Exhaustive CHAID is a modification of CHAID that does a more thorough job of
examining all possible splits but takes longer to compute.
|
|
The QUEST node provides a binary classification method for building decision trees, designed
to reduce the processing time required for large C&R Tree analyses while also reducing the
tendency found in classification tree methods to favor inputs that allow more splits. Input fields
can be numeric ranges (continuous), but the target field must be categorical. All splits are binary.
|
|
The Classification and Regression (C&R) Tree node generates a decision tree that allows
you to predict or classify future observations. The method uses recursive partitioning to split the
training records into segments by minimizing the impurity at each step, where a node in the tree is
considered “pure” if 100% of cases in the node fall into a specific category of the target field.
Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all
splits are binary (only two subgroups).
|
|
The C5.0 node builds either a decision tree or a rule set. The model works by splitting the
sample based on the field that provides the maximum information gain at each level. The target field
must be categorical. Multiple splits into more than two subgroups are allowed.
|
|
The Neural Net node uses a simplified model of the way the human brain processes information.
It works by simulating a large number of interconnected simple processing units that resemble
abstract versions of neurons. Neural networks are powerful general function estimators and require
minimal statistical or mathematical knowledge to train or apply.
|
|
Linear regression models predict a continuous target based on linear relationships
between the target and one or more predictors.
|
|
The Linear Support Vector Machine (LSVM) node enables you to classify data into one of two
groups without overfitting. LSVM is linear and works well with wide data sets, such as those with a
very large number of records. |
|
The Random Trees node is similar to the existing C&RT node; however, the Random Trees
node is designed to process big data to create a single tree and displays the resulting model in the
output viewer that was added in SPSS® Modeler version 17. The Random Trees
tree node generates a decision tree that you use to predict or classify future observations. The
method uses recursive partitioning to split the training records into segments by minimizing the
impurity at each step, where a node in the tree is considered pure if 100% of cases in
the node fall into a specific category of the target field. Target and input fields can be numeric
ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two
subgroups). |
|
The Tree-AS node is similar to the existing CHAID node; however, the Tree-AS node is designed
to process big data to create a single tree and displays the resulting model in the output viewer
that was added in SPSS Modeler version
17. The node generates a decision tree by using chi-square statistics (CHAID) to identify optimal
splits. This use of CHAID can generate nonbinary trees, meaning that some splits have more than two
branches. Target and input fields can be numeric range (continuous) or categorical. Exhaustive CHAID
is a modification of CHAID that does a more thorough job of examining all possible splits but takes
longer to compute. |
|
XGBoost Tree© is an advanced implementation of a gradient boosting algorithm
with a tree model as the base model. Boosting algorithms iteratively learn weak classifiers and then
add them to a final strong classifier. XGBoost Tree is very flexible and provides many parameters
that can be overwhelming to most users, so the XGBoost Tree node in SPSS Modeler exposes the core features and
commonly used parameters. The node is implemented in Python. |
|
XGBoost© is an advanced implementation of a gradient boosting algorithm.
Boosting algorithms iteratively learn weak classifiers and then add them to a final strong
classifier. XGBoost is very flexible and provides many parameters that can be overwhelming to most
users, so the XGBoost-AS node in SPSS Modeler exposes the core features and
commonly used parameters. The XGBoost-AS node is implemented in Spark. |
Note: If you select Tree-AS to run on Analytic Server, it will fail to build a model when there is
a Partition node upstream. In this case, to make Auto Classifier work with other modeling nodes on
Analytic Server, deselect the Tree-AS model type.