Common Graph Nodes Features

Several phases of the data mining process use graphs and charts to explore data brought into IBM® SPSS® Modeler. For example, you can connect a Plot or Distribution node to a data source to gain insight into data types and distributions. You can then perform record and field manipulations to prepare the data for downstream modeling operations. Another common use of graphs is to check the distribution and relationships between newly derived fields.

The Graphs palette contains the following nodes:

The Graphboard node offers many different types of graphs in one single node. Using this node, you can choose the data fields you want to explore and then select a graph from those available for the selected data. The node automatically filters out any graph types that would not work with the field choices.

The Plot node shows the relationship between numeric fields. You can create a plot by using points (a scatterplot) or lines.

The Distribution node shows the occurrence of symbolic (categorical) values, such as mortgage type or gender. Typically, you might use the Distribution node to show imbalances in the data, which you could then rectify using a Balance node before creating a model.

The Histogram node shows the occurrence of values for numeric fields. It is often used to explore the data before manipulations and model building. Similar to the Distribution node, the Histogram node frequently reveals imbalances in the data.

The Collection node shows the distribution of values for one numeric field relative to the values of another. (It creates graphs that are similar to histograms.) It is useful for illustrating a variable or field whose values change over time. Using 3-D graphing, you can also include a symbolic axis displaying distributions by category.

The Multiplot node creates a plot that displays multiple Y fields over a single X field. The Y fields are plotted as colored lines; each is equivalent to a Plot node with Style set to Line and X Mode set to Sort. Multiplots are useful when you want to explore the fluctuation of several variables over time.

The Web node illustrates the strength of the relationship between values of two or more symbolic (categorical) fields. The graph uses lines of various widths to indicate connection strength. You might use a Web node, for example, to explore the relationship between the purchase of a set of items at an e-commerce site.

The Time Plot node displays one or more sets of time series data. Typically, you would first use a Time Intervals node to create a TimeLabel field, which would be used to label the x axis.

The Evaluation node helps to evaluate and compare predictive models. The evaluation chart shows how well models predict particular outcomes. It sorts records based on the predicted value and confidence of the prediction. It splits the records into groups of equal size (quantiles) and then plots the value of the business criterion for each quantile from highest to lowest. Multiple models are shown as separate lines in the plot.

The Map Visualization node can accept multiple input connections and display geospatial data on a map as a series of layers. Each layer is a single geospatial field; for example, the base layer might be a map of a country, then above that you might have one layer for roads, one layer for rivers, and one layer for towns.

The E-Plot (Beta) node shows the relationship between numeric fields. It is similar to the Plot node, but its options differ and its output uses a new graphing interface specific to this node. Use the beta-level node to play around with new graphing features.

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a tool for visualizing high-dimensional data. It converts affinities of data points to probabilities. This t-SNE node in SPSS Modeler is implemented in Python and requires the scikit-learn© Python library.

When you have added a graph node to a stream, you can double-click the node to open a dialog box for specifying options. Most graphs contain a number of unique options presented on one or more tabs. There are also several tab options common to all graphs. The following topics contain more information about these common options.

When you have configured the options for a graph node, you can run it from within the dialog box or as part of a stream. In the generated graph window, you can generate Derive (Set and Flag) and Select nodes based on a selection or region of data, effectively "subsetting" the data. For example, you might use this powerful feature to identify and exclude outliers.