Several phases of the data mining process use graphs and charts
to explore data brought into IBM® SPSS® Modeler. For example,
you can connect a Plot or Distribution node to a data source to gain
insight into data types and distributions. You can then perform record
and field manipulations to prepare the data for downstream modeling
operations. Another common use of graphs is to check the distribution
and relationships between newly derived fields.
The Graphs palette contains the following nodes:
|
The Graphboard node offers many different types of graphs in
one single node. Using this node, you can choose the data fields you
want to explore and then select a graph from those available for the
selected data. The node automatically filters out any graph types
that would not work with the field choices. |
|
The Plot node shows the relationship between numeric fields. You can create a plot by using
points (a scatterplot) or lines. |
|
The Distribution node shows the occurrence of symbolic (categorical)
values, such as mortgage type or gender. Typically, you might use
the Distribution node to show imbalances in the data, which you could
then rectify using a Balance node before creating a model. |
|
The Histogram node shows the occurrence of values for numeric
fields. It is often used to explore the data before manipulations
and model building. Similar to the Distribution node, the Histogram
node frequently reveals imbalances in the data. |
|
The Collection node shows the distribution of values for one
numeric field relative to the values of another. (It creates graphs
that are similar to histograms.) It is useful for illustrating a variable
or field whose values change over time. Using 3-D graphing, you can
also include a symbolic axis displaying distributions by category.
|
|
The Multiplot node creates a plot that displays multiple Y fields
over a single X field. The Y fields are plotted as
colored lines; each is equivalent to a Plot node with Style set to Line and
X Mode set to Sort. Multiplots are useful
when you want to explore the fluctuation of several variables over
time. |
|
The Web node illustrates the strength of the relationship between
values of two or more symbolic (categorical) fields. The graph uses
lines of various widths to indicate connection strength. You might
use a Web node, for example, to explore the relationship between
the purchase of a set of items at an e-commerce site. |
|
The Time Plot node displays one or more sets of time series
data. Typically, you would first use a Time Intervals node to create
a TimeLabel field, which would be used to label the x axis. |
|
The Evaluation node helps to evaluate and compare predictive
models. The evaluation chart shows how well models predict particular
outcomes. It sorts records based on the predicted value and confidence
of the prediction. It splits the records into groups of equal size
(quantiles) and then plots the value of the business criterion
for each quantile from highest to lowest. Multiple models are shown
as separate lines in the plot. |
|
The Map Visualization node can accept multiple input connections and display geospatial data
on a map as a series of layers. Each layer is a single geospatial field; for example, the base layer
might be a map of a country, then above that you might have one layer for roads, one layer for
rivers, and one layer for towns. |
|
The E-Plot (Beta) node shows the relationship between numeric fields. It is similar to the
Plot node, but its options differ and its output uses a new graphing interface specific to this
node. Use the beta-level node to play around with new graphing features. |
|
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a tool for visualizing
high-dimensional data. It converts affinities of data points to probabilities. This t-SNE node in
SPSS Modeler is implemented in Python
and requires the scikit-learn© Python library. |
When you have added a graph node to a stream, you can double-click
the node to open a dialog box for specifying options. Most graphs
contain a number of unique options presented on one or more tabs.
There are also several tab options common to all graphs. The following topics
contain more information about these common options.
When you have configured the options for a graph node,
you can run it from within the dialog box or as part of a stream.
In the generated graph window, you can generate Derive (Set and Flag)
and Select nodes based on a selection or region of data, effectively
"subsetting" the data. For example, you might use this powerful feature
to identify and exclude outliers.