IBM® SPSS® Modeler offers a variety of modeling
methods taken from machine learning, artificial intelligence, and statistics. The methods available
on the Modeling palette allow you to derive new information from your data and to develop predictive
models. Each method has certain strengths and is best suited for particular types of problems.
The IBM SPSS Modeler Applications Guide provides
examples for many of these methods, along with a general introduction to the modeling process. This
guide is available as an online tutorial, and also in PDF format. More
information.
Modeling methods are divided into these categories:
- Supervised
- Association
- Segmentation
Supervised Models
Supervised models use the values of one or more input fields to predict the value of
one or more output, or target, fields. Some examples of these techniques are: decision trees
(C&R Tree, QUEST, CHAID and C5.0 algorithms), regression (linear, logistic, generalized linear,
and Cox regression algorithms), neural networks, support vector machines, and Bayesian networks.
Supervised models help organizations to predict a known result, such as whether a customer
will buy or leave or whether a transaction fits a known pattern of fraud. Modeling techniques
include machine learning, rule induction, subgroup identification, statistical methods, and multiple
model generation.
Supervised nodes
|
The Auto Classifier node creates and compares a number of different
models for binary outcomes (yes or no, churn or do not churn, and
so on), allowing you to choose the best approach for a given analysis.
A number of modeling algorithms are supported, making it possible
to select the methods you want to use, the specific options for each,
and the criteria for comparing the results. The node generates a set
of models based on the specified options and ranks the best candidates
according to the criteria you specify. |
|
The Auto Numeric node estimates and compares models for continuous
numeric range outcomes using a number of different methods. The node
works in the same manner as the Auto Classifier node, allowing you
to choose the algorithms to use and to experiment with multiple combinations
of options in a single modeling pass. Supported algorithms include
neural networks, C&R Tree, CHAID, linear regression, generalized
linear regression, and support vector machines (SVM). Models can be
compared based on correlation, relative error, or number of variables
used. |
|
The Classification and Regression (C&R) Tree node generates
a decision tree that allows you to predict or classify future observations.
The method uses recursive partitioning to split the training records
into segments by minimizing the impurity at each step, where a node
in the tree is considered “pure” if 100% of cases in the node fall
into a specific category of the target field. Target and input fields
can be numeric ranges or categorical (nominal, ordinal, or flags);
all splits are binary (only two subgroups). |
|
The QUEST node provides a binary classification method for
building decision trees, designed to reduce the processing time required
for large C&R Tree analyses while also reducing the tendency found
in classification tree methods to favor inputs that allow more splits.
Input fields can be numeric ranges (continuous), but the target field
must be categorical. All splits are binary. |
|
The CHAID node generates decision trees using chi-square statistics
to identify optimal splits. Unlike the C&R Tree and QUEST nodes,
CHAID can generate nonbinary trees, meaning that some splits have
more than two branches. Target and input fields can be numeric range
(continuous) or categorical. Exhaustive CHAID is a modification of
CHAID that does a more thorough job of examining all possible splits
but takes longer to compute. |
|
The C5.0 node builds either a decision tree or a rule set.
The model works by splitting the sample based on the field that provides
the maximum information gain at each level. The target field must
be categorical. Multiple splits into more than two subgroups are allowed. |
|
The Decision List node identifies subgroups, or segments, that
show a higher or lower likelihood of a given binary outcome relative
to the overall population. For example, you might look for customers
who are unlikely to churn or are most likely to respond favorably
to a campaign. You can incorporate your business knowledge into the
model by adding your own custom segments and previewing alternative
models side by side to compare the results. Decision List models consist
of a list of rules in which each rule has a condition and an outcome.
Rules are applied in order, and the first rule that matches determines
the outcome. |
|
Linear regression models predict a continuous target based on linear relationships
between the target and one or more predictors. |
|
The PCA/Factor node provides powerful data-reduction techniques
to reduce the complexity of your data. Principal components analysis
(PCA) finds linear combinations of the input fields that do the best
job of capturing the variance in the entire set of fields, where the
components are orthogonal (perpendicular) to each other. Factor analysis
attempts to identify underlying factors that explain the pattern of
correlations within a set of observed fields. For both approaches,
the goal is to find a small number of derived fields that effectively
summarizes the information in the original set of fields. |
|
The Feature Selection node screens input fields for removal
based on a set of criteria (such as the percentage of missing values);
it then ranks the importance of remaining inputs relative to a specified
target. For example, given a data set with hundreds of potential inputs,
which are most likely to be useful in modeling patient outcomes? |
|
Discriminant analysis makes more stringent assumptions than
logistic regression but can be a valuable alternative or supplement
to a logistic regression analysis when those assumptions are met. |
|
Logistic regression is a statistical technique for classifying
records based on values of input fields. It is analogous to linear
regression but takes a categorical target field instead of a numeric
range. |
|
The Generalized Linear model expands the general linear model
so that the dependent variable is linearly related to the factors
and covariates through a specified link function. Moreover, the model
allows for the dependent variable to have a non-normal distribution.
It covers the functionality of a wide number of statistical models,
including linear regression, logistic regression, loglinear models
for count data, and interval-censored survival models. |
|
A generalized linear mixed model (GLMM) extends the linear
model so that the target can have a non-normal distribution, is linearly
related to the factors and covariates via a specified link function,
and so that the observations can be correlated. Generalized linear
mixed models cover a wide variety of models, from simple linear regression
to complex multilevel models for non-normal longitudinal data. |
|
The Cox regression node enables you to build a survival model
for time-to-event data in the presence of censored records. The model
produces a survival function that predicts the probability that the
event of interest has occurred at a given time (t) for given
values of the input variables. |
|
The Support Vector Machine (SVM) node enables you to classify
data into one of two groups without overfitting. SVM works well with
wide data sets, such as those with a very large number of input fields. |
|
The Bayesian Network node enables you to build a probability
model by combining observed and recorded evidence with real-world
knowledge to establish the likelihood of occurrences. The node focuses
on Tree Augmented Naïve Bayes (TAN) and Markov Blanket networks that
are primarily used for classification. |
|
The Self-Learning Response Model (SLRM) node enables you to
build a model in which a single new case, or small number of new cases,
can be used to reestimate the model without having to retrain the
model using all data. |
|
The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated
Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series data
and produces forecasts of future performance. This Time Series node is similar to the previous Time
Series node that was deprecated in SPSS Modeler version 18. However, this newer Time Series node is designed to harness the power of IBM SPSS Analytic Server to process
big data, and display the resulting model in the output viewer that was added in SPSS Modeler version 17. |
|
The k-Nearest Neighbor (KNN) node associates a new case
with the category or value of the k objects nearest to it in
the predictor space, where k is an integer. Similar cases are
near each other and dissimilar cases are distant from each other. |
|
The Spatio-Temporal Prediction (STP) node uses data that contains location data, input fields
for prediction (predictors), a time field, and a target field. Each location has numerous rows in
the data that represent the values of each predictor at each time of measurement. After the data is
analyzed, it can be used to predict target values at any location within the shape data that is used
in the analysis. |
Association Models
Association models find patterns
in your data where one or more entities (such as events, purchases,
or attributes) are associated with one or more other entities. The
models construct rule sets that define these relationships. Here the
fields within the data can act as both inputs and targets. You could
find these associations manually, but association rule algorithms
do so much more quickly, and can explore more complex patterns. Apriori
and Carma models are examples of the use of such algorithms. One other
type of association model is a sequence detection model, which finds
sequential patterns in time-structured data.
Association models are most useful when predicting multiple outcomes—for example,
customers who bought product X also bought Y and Z. Association models associate a particular
conclusion (such as the decision to buy something) with a set of conditions. The advantage of
association rule algorithms over the more standard decision tree algorithms (C5.0 and C&RT) is
that associations can exist between any of the attributes. A decision tree algorithm will build
rules with only a single conclusion, whereas association algorithms attempt to find many rules, each
of which may have a different conclusion.
Association nodes
|
The Apriori node extracts a set of rules from the data, pulling
out the rules with the highest information content. Apriori offers
five different methods of selecting rules and uses a sophisticated
indexing scheme to process large data sets efficiently. For large
problems, Apriori is generally faster to train; it has no arbitrary
limit on the number of rules that can be retained, and it can handle
rules with up to 32 preconditions. Apriori requires that input and
output fields all be categorical but delivers better performance because
it is optimized for this type of data. |
|
The CARMA model extracts a set of rules from the data without
requiring you to specify input or target fields. In contrast to Apriori the CARMA node offers build settings for rule
support (support for both antecedent and consequent) rather than just
antecedent support. This means that the rules generated can be used
for a wider variety of applications—for example, to find a list of
products or services (antecedents) whose consequent is the item that
you want to promote this holiday season. |
|
The Sequence node discovers association rules in sequential
or time-oriented data. A sequence is a list of item sets that tends
to occur in a predictable order. For example, a customer who purchases
a razor and aftershave lotion may purchase shaving cream the next
time he shops. The Sequence node is based on the CARMA association
rules algorithm, which uses an efficient two-pass method for finding
sequences. |
|
The Association Rules Node is similar to the Apriori Node; however, unlike Apriori, the
Association Rules Node can process list data. In addition, the Association
Rules Node can be used with IBM SPSS Analytic Server to process
big data and take advantage of faster parallel processing. |
Segmentation Models
Segmentation models divide the
data into segments, or clusters, of records that have similar patterns
of input fields. As they are only interested in the input fields,
segmentation models have no concept of output or target fields. Examples
of segmentation models are Kohonen networks, K-Means clustering, two-step
clustering and anomaly detection.
Segmentation models (also known as "clustering models") are useful in cases where the
specific result is unknown (for example, when identifying new patterns of fraud, or when identifying
groups of interest in your customer base). Clustering models focus on identifying groups of similar
records and labeling the records according to the group to which they belong. This is done without
the benefit of prior knowledge about the groups and their characteristics, and it distinguishes
clustering models from the other modeling techniques in that there is no predefined output or target
field for the model to predict. There are no right or wrong answers for these models. Their value is
determined by their ability to capture interesting groupings in the data and provide useful
descriptions of those groupings. Clustering models are often used to create clusters or segments
that are then used as inputs in subsequent analyses (for example, by segmenting potential customers
into homogeneous subgroups).
Segmentation nodes
|
The Auto Cluster node estimates and compares clustering models,
which identify groups of records that have similar characteristics.
The node works in the same manner as other automated modeling nodes,
allowing you to experiment with multiple combinations of options in
a single modeling pass. Models can be compared using basic measures
with which to attempt to filter and rank the usefulness of the cluster
models, and provide a measure based on the importance of particular
fields. |
|
The K-Means node clusters the data set into distinct groups
(or clusters). The method defines a fixed number of clusters, iteratively
assigns records to clusters, and adjusts the cluster centers until
further refinement can no longer improve the model. Instead of trying
to predict an outcome, k-means
uses a process known as unsupervised learning to uncover patterns
in the set of input fields. |
|
The Kohonen node generates a type of neural network that can
be used to cluster the data set into distinct groups. When the network
is fully trained, records that are similar should be close together
on the output map, while records that are different will be far apart.
You can look at the number of observations captured by each unit in
the model nugget to identify the strong units. This may give you a
sense of the appropriate number of clusters. |
|
The TwoStep node uses a two-step clustering method. The first
step makes a single pass through the data to compress the raw input
data into a manageable set of subclusters. The second step uses a
hierarchical clustering method to progressively merge the subclusters
into larger and larger clusters. TwoStep has the advantage of automatically
estimating the optimal number of clusters for the training data. It
can handle mixed field types and large data sets efficiently. |
|
The Anomaly Detection node identifies unusual cases, or outliers,
that do not conform to patterns of “normal” data. With this node,
it is possible to identify outliers even if they do not fit any previously
known patterns and even if you are not exactly sure what you are looking
for. |
In-Database Mining Models
IBM SPSS Modeler supports integration with data
mining and modeling tools that are available from database vendors, including Oracle Data
Miner and
Microsoft Analysis Services. You can build, score, and store models inside the database—all from
within the IBM SPSS Modeler application. For
full details, see the IBM SPSS Modeler In-Database Mining
Guide.
IBM SPSS Statistics Models
If you have a copy of IBM SPSS Statistics installed and
licensed on your computer, you can access and run certain IBM SPSS Statistics routines from
within IBM SPSS Modeler to build and score
models.