IBM® SPSS® Modeler offers a variety of modeling
methods taken from machine learning, artificial intelligence, and statistics. The methods available
on the Modeling palette allow you to derive new information from your data and to develop predictive
models. Each method has certain strengths and is best suited for particular types of problems.
The
IBM SPSS Modeler Applications Guide provides
examples for many of these methods, along with a general introduction to the modeling process. This
guide is available as an online tutorial, and also in PDF format.
More
information.
Modeling methods are divided into these categories:
- Supervised
- Association
- Segmentation
Supervised Models
Supervised models use the values of one or more input fields to predict the value of
one or more output, or target, fields. Some examples of these techniques are: decision trees
(C&R Tree, QUEST, CHAID and C5.0 algorithms), regression (linear, logistic, generalized linear,
and Cox regression algorithms), neural networks, support vector machines, and Bayesian networks.
Supervised models help organizations to predict a known result, such as whether a customer
will buy or leave or whether a transaction fits a known pattern of fraud. Modeling techniques
include machine learning, rule induction, subgroup identification, statistical methods, and multiple
model generation.
Supervised nodes
|
The Auto Classifier node creates and compares a number of different models for binary
outcomes (yes or no, churn or do not churn, and so on), allowing you to choose the best approach for
a given analysis. A number of modeling algorithms are supported, making it possible to select the
methods you want to use, the specific options for each, and the criteria for comparing the results.
The node generates a set of models based on the specified options and ranks the best candidates
according to the criteria you specify.
|
|
The Auto Numeric node estimates and compares models for continuous numeric range outcomes
using a number of different methods. The node works in the same manner as the Auto Classifier node,
allowing you to choose the algorithms to use and to experiment with multiple combinations of options
in a single modeling pass. Supported algorithms include neural networks, C&R Tree, CHAID, linear
regression, generalized linear regression, and support vector machines (SVM). Models can be compared
based on correlation, relative error, or number of variables used.
|
|
The Classification and Regression (C&R) Tree node generates a decision tree that allows
you to predict or classify future observations. The method uses recursive partitioning to split the
training records into segments by minimizing the impurity at each step, where a node in the tree is
considered “pure” if 100% of cases in the node fall into a specific category of the target field.
Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all
splits are binary (only two subgroups).
|
|
The QUEST node provides a binary classification method for building decision trees, designed
to reduce the processing time required for large C&R Tree analyses while also reducing the
tendency found in classification tree methods to favor inputs that allow more splits. Input fields
can be numeric ranges (continuous), but the target field must be categorical. All splits are binary.
|
|
The CHAID node generates decision trees using chi-square statistics to identify optimal
splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate nonbinary trees, meaning that
some splits have more than two branches. Target and input fields can be numeric range (continuous)
or categorical. Exhaustive CHAID is a modification of CHAID that does a more thorough job of
examining all possible splits but takes longer to compute.
|
|
The C5.0 node builds either a decision tree or a rule set. The model works by splitting the
sample based on the field that provides the maximum information gain at each level. The target field
must be categorical. Multiple splits into more than two subgroups are allowed.
|
|
The Decision List node identifies subgroups, or segments, that show a higher or lower
likelihood of a given binary outcome relative to the overall population. For example, you might look
for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You
can incorporate your business knowledge into the model by adding your own custom segments and
previewing alternative models side by side to compare the results. Decision List models consist of a
list of rules in which each rule has a condition and an outcome. Rules are applied in order, and the
first rule that matches determines the outcome.
|
|
Linear regression models predict a continuous target based on linear relationships
between the target and one or more predictors.
|
|
The PCA/Factor node provides powerful data-reduction techniques to reduce the complexity of
your data. Principal components analysis (PCA) finds linear combinations of the input fields that do
the best job of capturing the variance in the entire set of fields, where the components are
orthogonal (perpendicular) to each other. Factor analysis attempts to identify underlying factors
that explain the pattern of correlations within a set of observed fields. For both approaches, the
goal is to find a small number of derived fields that effectively summarizes the information in the
original set of fields.
|
|
The Feature Selection node screens input fields for removal based on a set of criteria (such
as the percentage of missing values); it then ranks the importance of remaining inputs relative to a
specified target. For example, given a data set with hundreds of potential inputs, which are most
likely to be useful in modeling patient outcomes?
|
|
Discriminant analysis makes more stringent assumptions than logistic regression but can be a
valuable alternative or supplement to a logistic regression analysis when those assumptions are met.
|
|
Logistic regression is a statistical technique for classifying records based on values of
input fields. It is analogous to linear regression but takes a categorical target field instead of a
numeric range.
|
|
The Generalized Linear model expands the general linear model so that the dependent variable
is linearly related to the factors and covariates through a specified link function. Moreover, the
model allows for the dependent variable to have a non-normal distribution. It covers the
functionality of a wide number of statistical models, including linear regression, logistic
regression, loglinear models for count data, and interval-censored survival models.
|
|
A generalized linear mixed model (GLMM) extends the linear model so that the target can have
a non-normal distribution, is linearly related to the factors and covariates via a specified link
function, and so that the observations can be correlated. Generalized linear mixed models cover a
wide variety of models, from simple linear regression to complex multilevel models for non-normal
longitudinal data. |
|
The Cox regression node enables you to build a survival model for time-to-event data in the
presence of censored records. The model produces a survival function that predicts the probability
that the event of interest has occurred at a given time (t) for given values of the input
variables.
|
|
The Support Vector Machine (SVM) node enables you to classify data into one of two groups
without overfitting. SVM works well with wide data sets, such as those with a very large number of
input fields.
|
|
The Bayesian Network node enables you to build a probability model by combining observed and
recorded evidence with real-world knowledge to establish the likelihood of occurrences. The node
focuses on Tree Augmented Naïve Bayes (TAN) and Markov Blanket networks that are primarily used for
classification.
|
|
The Self-Learning Response Model (SLRM) node enables you to build a model in which a single
new case, or small number of new cases, can be used to reestimate the model without having to
retrain the model using all data. |
|
The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated
Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series data
and produces forecasts of future performance. This Time Series node is similar to the previous Time
Series node that was deprecated in SPSS Modeler version 18. However, this
newer Time Series node is designed to harness the power of IBM SPSS Analytic Server to process
big data, and display the resulting model in the output viewer that was added in SPSS Modeler version 17. |
|
The k-Nearest Neighbor (KNN) node associates a new case with the category or value of
the k objects nearest to it in the predictor space, where k is an integer. Similar
cases are near each other and dissimilar cases are distant from each other.
|
|
The Spatio-Temporal Prediction (STP) node uses data that contains location data, input fields
for prediction (predictors), a time field, and a target field. Each location has numerous rows in
the data that represent the values of each predictor at each time of measurement. After the data is
analyzed, it can be used to predict target values at any location within the shape data that is used
in the analysis. |
Association Models
Association models find patterns in your data where one or more entities (such as events,
purchases, or attributes) are associated with one or more other entities. The models construct rule
sets that define these relationships. Here the fields within the data can act as both inputs and
targets. You could find these associations manually, but association rule algorithms do so much more
quickly, and can explore more complex patterns. Apriori and Carma models are examples of the use of
such algorithms. One other type of association model is a sequence detection model, which finds
sequential patterns in time-structured data.
Association models are most useful when predicting multiple outcomes—for example,
customers who bought product X also bought Y and Z. Association models associate a particular
conclusion (such as the decision to buy something) with a set of conditions. The advantage of
association rule algorithms over the more standard decision tree algorithms (C5.0 and C&RT) is
that associations can exist between any of the attributes. A decision tree algorithm will build
rules with only a single conclusion, whereas association algorithms attempt to find many rules, each
of which may have a different conclusion.
Association nodes
|
The Apriori node extracts a set of rules from the data, pulling out the rules with the
highest information content. Apriori offers five different methods of selecting rules and uses a
sophisticated indexing scheme to process large data sets efficiently. For large problems, Apriori is
generally faster to train; it has no arbitrary limit on the number of rules that can be retained,
and it can handle rules with up to 32 preconditions. Apriori requires that input and output fields
all be categorical but delivers better performance because it is optimized for this type of data.
|
|
The CARMA model extracts a set of rules from the data without requiring you to specify input
or target fields. In contrast to
Apriori the CARMA node offers
build settings for rule support (support for both antecedent and consequent) rather than just
antecedent support. This means that the rules generated can be used for a wider variety of
applications—for example, to find a list of products or services (antecedents) whose consequent is
the item that you want to promote this holiday season.
|
|
The Sequence node discovers association rules in sequential or time-oriented data. A sequence
is a list of item sets that tends to occur in a predictable order. For example, a customer who
purchases a razor and aftershave lotion may purchase shaving cream the next time he shops. The
Sequence node is based on the CARMA association rules algorithm, which uses an efficient two-pass
method for finding sequences. |
|
The Association Rules Node is similar to the Apriori Node; however, unlike Apriori, the
Association Rules Node can process list data. In addition, the Association Rules Node can be used
with IBM SPSS Analytic Server to
process big data and take advantage of faster parallel processing. |
Segmentation Models
Segmentation models divide the data into segments, or clusters, of records that have similar
patterns of input fields. As they are only interested in the input fields, segmentation models have
no concept of output or target fields. Examples of segmentation models are Kohonen networks, K-Means
clustering, two-step clustering and anomaly detection.
Segmentation models (also known as "clustering models") are useful in cases where the
specific result is unknown (for example, when identifying new patterns of fraud, or when identifying
groups of interest in your customer base). Clustering models focus on identifying groups of similar
records and labeling the records according to the group to which they belong. This is done without
the benefit of prior knowledge about the groups and their characteristics, and it distinguishes
clustering models from the other modeling techniques in that there is no predefined output or target
field for the model to predict. There are no right or wrong answers for these models. Their value is
determined by their ability to capture interesting groupings in the data and provide useful
descriptions of those groupings. Clustering models are often used to create clusters or segments
that are then used as inputs in subsequent analyses (for example, by segmenting potential customers
into homogeneous subgroups).
Segmentation nodes
|
The Auto Cluster node estimates and compares clustering models, which identify groups of
records that have similar characteristics. The node works in the same manner as other automated
modeling nodes, allowing you to experiment with multiple combinations of options in a single
modeling pass. Models can be compared using basic measures with which to attempt to filter and rank
the usefulness of the cluster models, and provide a measure based on the importance of particular
fields.
|
|
The K-Means node clusters the data set into distinct groups (or clusters). The method defines
a fixed number of clusters, iteratively assigns records to clusters, and adjusts the cluster centers
until further refinement can no longer improve the model. Instead of trying to predict an outcome,
k-means uses a process known as unsupervised learning to uncover patterns in the set of input
fields. |
|
The Kohonen node generates a type of neural network that can be used to cluster the data set
into distinct groups. When the network is fully trained, records that are similar should be close
together on the output map, while records that are different will be far apart. You can look at the
number of observations captured by each unit in the model nugget to identify the strong units. This
may give you a sense of the appropriate number of clusters. |
|
The TwoStep node uses a two-step clustering method. The first step makes a single pass
through the data to compress the raw input data into a manageable set of subclusters. The second
step uses a hierarchical clustering method to progressively merge the subclusters into larger and
larger clusters. TwoStep has the advantage of automatically estimating the optimal number of
clusters for the training data. It can handle mixed field types and large data sets efficiently.
|
|
The Anomaly Detection node identifies unusual cases, or outliers, that do not conform to
patterns of “normal” data. With this node, it is possible to identify outliers even if they do not
fit any previously known patterns and even if you are not exactly sure what you are looking for.
|
In-Database Mining Models
IBM SPSS Modeler supports integration with data
mining and modeling tools that are available from database vendors, including Oracle Data
Miner and
Microsoft Analysis Services. You can build, score, and store models inside the database—all from
within the IBM SPSS Modeler application. For
full details, see the
IBM SPSS Modeler In-Database Mining
Guide.
IBM SPSS Statistics Models
If you have a copy of IBM SPSS Statistics installed and
licensed on your computer, you can access and run certain IBM SPSS Statistics routines from
within IBM SPSS Modeler to build and score
models.