Building the Stream

Figure 1. Modeling stream
Modeling stream

To build a stream that will create a model, we need at least three elements:

  • A source node that reads in data from some external source, in this case an IBM® SPSS® Statistics data file.
  • A source or Type node that specifies field properties, such as measurement level (the type of data that the field contains), and the role of each field as a target or input in modeling.
  • A modeling node that generates a model nugget when the stream is run.

In this example, we’re using a CHAID modeling node. CHAID, or Chi-squared Automatic Interaction Detection, is a classification method that builds decision trees by using a particular type of statistics known as chi-square statistics to work out the best places to make the splits in the decision tree.

If measurement levels are specified in the source node, the separate Type node can be eliminated. Functionally, the result is the same.

This stream also has Table and Analysis nodes that will be used to view the scoring results after the model nugget has been created and added to the stream.

The Statistics File source node reads data in IBM SPSS Statistics format from the tree_credit.sav data file, which is installed in the Demos folder. (A special variable named $CLEO_DEMOS is used to reference this folder under the current IBM SPSS Modeler installation. This ensures the path will be valid regardless of the current installation folder or version.)

Figure 2. Reading data with a Statistics File source node
Reading data with a Statistics File source node

The Type node specifies the measurement level for each field. The measurement level is a category that indicates the type of data in the field. Our source data file uses three different measurement levels.

A Continuous field (such as the Age field) contains continuous numeric values, while a Nominal field (such as the Credit rating field) has two or more distinct values, for example Bad, Good, or No credit history. An Ordinal field (such as the Income level field) describes data with multiple distinct values that have an inherent order—in this case Low, Medium and High.

Figure 3. Setting the target and input fields with the Type node
Setting the target and input fields with the Type node

For each field, the Type node also specifies a role, to indicate the part that each field plays in modeling. The role is set to Target for the field Credit rating, which is the field that indicates whether or not a given customer defaulted on the loan. This is the target, or the field for which we want to predict the value.

Role is set to Input for the other fields. Input fields are sometimes known as predictors, or fields whose values are used by the modeling algorithm to predict the value of the target field.

The CHAID modeling node generates the model.

On the Fields tab in the modeling node, the option Use predefined roles is selected, which means the target and inputs will be used as specified in the Type node. We could change the field roles at this point, but for this example we'll use them as they are.

  1. Click the Build Options tab.
    Figure 4. CHAID modeling node, Fields tab
    CHAID modeling node, Fields tab

    Here there are several options where we could specify the kind of model we want to build.

    We want a brand-new model, so we'll use the default option Build new model.

    We also just want a single, standard decision tree model without any enhancements, so we'll also leave the default objective option Build a single tree.

    While we can optionally launch an interactive modeling session that allows us to fine-tune the model, this example simply generates a model using the default mode setting Generate model.

    Figure 5. CHAID modeling node, Build Options tab
    CHAID modeling node, Build Options tab

    For this example, we want to keep the tree fairly simple, so we'll limit the tree growth by raising the minimum number of cases for parent and child nodes.

  2. On the Build Options tab, select Stopping Rules from the navigator pane on the left.
  3. Select the Use absolute value option.
  4. Set Minimum records in parent branch to 400.
  5. Set Minimum records in child branch to 200.
Figure 6. Setting the stopping criteria for decision tree building
Setting the stopping criteria for decision tree building

We can use all the other default options for this example, so click Run to create the model. (Alternatively, right-click on the node and choose Run from the context menu, or select the node and choose Run from the Tools menu.)

Next