Several phases of the data mining process use graphs and charts to explore
data brought into IBM® SPSS® Modeler. For example,
you can connect a Plot or Distribution node to a data source to gain insight into data types and
distributions. You can then perform record and field manipulations to prepare the data for
downstream modeling operations. Another common use of graphs is to check the distribution and
relationships between newly derived fields.
The Graphs palette contains the following nodes:
|
The Graphboard node offers many different types of graphs in one single node. Using this
node, you can choose the data fields you want to explore and then select a graph from those
available for the selected data. The node automatically filters out any graph types that would not
work with the field choices.
|
|
The Plot node shows the relationship between numeric fields. You can create a plot by using
points (a scatterplot) or lines.
|
|
The Distribution node shows the occurrence of symbolic (categorical) values, such as mortgage
type or gender. Typically, you might use the Distribution node to show imbalances in the data, which
you could then rectify using a Balance node before creating a model.
|
|
The Histogram node shows the occurrence of values for numeric fields. It is often used to
explore the data before manipulations and model building. Similar to the Distribution node, the
Histogram node frequently reveals imbalances in the data.
|
|
The Collection node shows the distribution of values for one numeric field relative to the
values of another. (It creates graphs that are similar to histograms.) It is useful for illustrating
a variable or field whose values change over time. Using 3-D graphing, you can also include a
symbolic axis displaying distributions by category.
|
|
The Multiplot node creates a plot that displays multiple Y fields over a single
X field. The Y fields are plotted as colored lines; each is equivalent to a Plot node
with Style set to Line and X Mode set to Sort.
Multiplots are useful when you want to explore the fluctuation of several variables over time.
|
|
The Web node illustrates the strength of the relationship between values of two or more
symbolic (categorical) fields. The graph uses lines of various widths to indicate connection
strength. You might use a Web node, for example, to explore the relationship between the purchase of
a set of items at an e-commerce site.
|
|
The Time Plot node displays one or more sets of time series data. Typically, you would first
use a Time Intervals node to create a TimeLabel field, which would be used to label the
x axis.
|
|
The Evaluation node helps to evaluate and compare predictive models. The evaluation chart
shows how well models predict particular outcomes. It sorts records based on the predicted value and
confidence of the prediction. It splits the records into groups of equal size (quantiles) and
then plots the value of the business criterion for each quantile from highest to lowest. Multiple
models are shown as separate lines in the plot.
|
|
The Map Visualization node can accept multiple input connections and display geospatial data
on a map as a series of layers. Each layer is a single geospatial field; for example, the base layer
might be a map of a country, then above that you might have one layer for roads, one layer for
rivers, and one layer for towns. |
|
The E-Plot (Beta) node shows the relationship between numeric fields. It is similar to the
Plot node, but its options differ and its output uses a new graphing interface specific to this
node. Use the beta-level node to play around with new graphing features. |
|
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a tool for visualizing
high-dimensional data. It converts affinities of data points to probabilities. This t-SNE node in
SPSS Modeler is implemented in Python
and requires the scikit-learn © Python library. |
When you have added a graph node to a stream, you can double-click the node to
open a dialog box for specifying options. Most graphs contain a number of unique options presented
on one or more tabs. There are also several tab options common to all graphs. The following topics
contain more information about
these common options.
When you have configured the options for a graph node, you can run it from
within the dialog box or as part of a stream. In the generated graph window, you can generate Derive
(Set and Flag) and Select nodes based on a selection or region of data, effectively "subsetting" the
data. For example, you might use this powerful feature to identify and exclude outliers.