Introduction to modeling

This tutorial provides an introduction to modeling with SPSS® Modeler. A model is a set of rules, formulas, or equations that can be used to predict an outcome based on a set of input fields or variables. For example, a financial institution might use a model to predict whether loan applicants are likely to be good or bad risks, based on information that is already known about them.

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface that is shown in the video. The video is intended to be a companion to the written tutorial. This video provides a visual method to learn the concepts and tasks in this documentation.

Try the tutorial

In this tutorial, you will complete these tasks:

Sample modeler flow and data set

This tutorial uses the Introduction to Modeling flow in the sample project. The data file used is tree_credit.csv. The following image shows the sample modeler flow.

Figure 1. Sample modeler flow
Sample modeler flow

The ability to predict an outcome is the central goal of predictive analytics, and understanding the modeling process is the key to using SPSS Modeler flows.

The model in this example shows how a bank can predict if future loan applicants might default on their loans. These customers previously took loans from the bank, so the customers’ data is stored in the bank's database. The model uses the customers’ data to determine how likely they are to default.

An important part of any model is the data that goes into it. The bank maintains a database of historical information on customers, including whether they repaid the loans (Credit rating = Good) or defaulted (Credit rating = Bad). The bank wants to use this existing data to build the model. The following fields are used:

Field name Description
Credit_rating Credit rating: 0=Bad, 1=Good, 9=missing values
Age Age in years
Income Income level: 1=Low, 2=Medium, 3=High
Credit_cards Number of credit cards held: 1=Less than five, 2=Five or more
Education Level of education: 1=High school, 2=College
Car_loans Number of car loans taken out: 1=None or one, 2=More than two

This example uses a decision tree model, which classifies records (and predicts a response) by using a series of decision rules.

Figure 2. A decision tree model
A decision tree model

For example, this decision rule classifies a record as having a good credit rating when the income falls in the medium range and the number of credit cards are less than 5.

IF income = Medium 
AND cards <5
THEN -> 'Good'

Using a decision tree model, you can analyze the characteristics of the two groups of customers and predict the likelihood of loan defaults.

While this example uses a CHAID (Chi-squared Automatic Interaction Detection) model, it is intended as a general introduction, and most of the concepts apply broadly to other modeling types in SPSS Modeler.

Task 1: Open the sample project

The sample project contains several data sets and sample modeler flows. If you don't already have the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample project:

  1. In Cloud Pak for Data, from the Navigation menu Navigation menu, choose Projects > View all Projects.
  2. Click SPSS Modeler Project.
  3. Click the Assets tab to see the data sets and modeler flows.

Checkpoint icon Check your progress

The following image shows the project Assets tab. You are now ready to work with the sample modeler flow associated with this tutorial.

Sample project

Back to the top

Task 2: Examine the Data Asset and Type nodes

Introduction to Modeling modeler flow includes several nodes. Follow these steps to examine the Data Asset and Type nodes.

  1. From the Assets tab, open the Introduction to Modeling modeler flow, and wait for the canvas to load.
  2. Double-click the tree_credit.csv node. This node is a Data Asset node that points to the tree_credit.csv file in the project. If you specify measurements in the source node, you don’t need to include a separate Type node in the flow.
  3. Review the File format properties.
  4. Optional: Click Preview data to see the full data set.
  5. Double-click the Type node. This node specifies field properties, such as measurement level (the type of data that the field contains), and the role of each field as a target or input in modeling. The measurement level is a category that indicates the type of data in the field. The source data file uses three different measurement levels:
    • A Continuous field (such as the Age field) contains continuous numeric values.
    • A Nominal field (such as the Education field) has two or more distinct values: in this case, College or High school.
    • An Ordinal field (such as the Income level field) describes data with multiple distinct values that have an inherent order: in this case, Low, Medium, and High.
    Figure 3. Type node
    Type node

    For each field, the Type node also specifies a role to indicate the part that each field plays in modeling. The role is set to Target for the field Credit rating, which is the field that indicates whether a customer defaulted on the loan. The target is the field for which you want to predict the value.

    The other fields have the Role set to Input. Input fields are sometimes known as predictors, or fields whose values are used by the modeling algorithm to predict the value of the target field.

  6. Optional: Click Preview data to see the data with the Type properties applied.

Checkpoint icon Check your progress

The following image shows the Type node. You are now ready to configure the Modeling node.

Type node

Back to the top

Task 3: Configure the Modeling node

A modeling node generates a model nugget when the flow runs. This example uses a CHAID node. CHAID, or Chi-squared Automatic Interaction Detection, is a classification method that builds decision trees by using a particular type of statistics that are known as chi-square statistics. The node uses chi-square statistics to determine the best places to make the splits in the decision tree. Follow these steps to configure the Modeling node:

  1. Double-click the Credit rating (CHAID) node to see its properties.
  2. In the Fields section, notice the Use settings defined in this node option. This option tells the node to use the target and fields specified here instead of using the field information in the Type node. For this tutorial, leave the Use settings defined in this node option turned off.
  3. Expand the Objectives section. In this case, the default values are appropriate. Your objective is to Build new model, Create a standard model, and Generate a model node after run.
  4. Expand the Stopping Rules section. To keep the tree fairly simple for this example, limit the tree growth by raising the minimum number of cases for parent and child nodes.
    1. Select Use absolute value.
    2. Set Minimum records in parent branch to 400.
    3. Set Minimum records in child branch to 200.
  5. Click Save.
  6. Hover over the Credit rating (CHAID) node, and click the Run icon Run icon.

Checkpoint icon Check your progress

The following image shows the flow with the model results. You are now ready to explore the model.

Results pane

Back to the top

Task 4: Explore the model

Running the modeler flow adds a model nugget to the canvas with a link to the Modeling node from which it was created. Follow these steps to view the model details:

  1. In the Outputs and models pane, click the model with the name Credit rating to view the model.
  2. Click Model Information to see basic information about the model.
  3. Click Feature Importance to see the relative importance of each predictor in estimating the model. From this chart, you can see that Income level is easily the most significant in this case, with Number of credit cards as the next most significant factor.
    Figure 4. Feature Importance chart
    Feature Importance chart
  4. Click Top Decision Rules to see details in the form of a rule set; essentially a series of rules that can be used to assign individual records to child nodes based on the values of different input fields. A prediction of Good or Bad is returned for each terminal node in the decision tree. Terminal nodes are those tree nodes that are not split further. In each case, the prediction is determined by the mode, or most common response, for records that fall within that node.
    Figure 5. CHAID model nugget, rule set
    CHAID model nugget, rule set
  5. Click Tree Diagram to see the same model in the form of a tree, with a node at each decision point. Hover over branches and nodes to explore details.
    Figure 6. Tree diagram in the model nugget
    Tree diagram in the model nugget

    Looking at the start of the tree, the first node (node 0) gives a summary for all the records in the data set. Just over 40% of the cases in the data set are classified as a bad risk. 40% is quite a high proportion, but the tree might give clues as to what factors might be responsible.

    The first split is by Income level. Records where the income level is in the Low category are assigned to node 2, and it's no surprise to see that this category contains the highest percentage of loan defaulters. Clearly, lending to customers in this category carries a high risk. However, almost 18% of the customers in this category didn’t default, so the prediction is not always correct. No model can feasibly predict every response, but a good model should allow you to predict the most likely response for each record based on the available data.

    In the same way, if you look at the high-income customers (node 1), you can see that most customers (over 88%) are a good risk. But more than 1 in 10 of these customers still defaulted. Can the lending criteria be refined further to minimize the risk here?

    Notice how the model divided these customers into two subcategories (nodes 4 and 5), based on the number of credit cards held. For high-income customers, if the bank lends to only customers with fewer than five credit cards, it can increase its success rate from 88% to almost 97%; an even more satisfactory outcome.

    Figure 7. High-income customers with fewer than five credit cards
    High-income customers with fewer than five credit cards

    But what about those customers in the Medium income category (node 3)? They’re much more evenly divided between Good and Bad ratings. Again, the subcategories (nodes 6 and 7 in this case) can help. This time, lending only to those medium-income customers with fewer than five credit cards increases the percentage of Good ratings from 58% to 86%, a significant improvement.

    Figure 8. Tree view of medium-income customers
    Tree view of medium-income customers

Checkpoint icon Check your progress

The following image shows the model details. You are now ready to evaluate the model.

Model information

Back to the top

Task 5: Evaluate the model

You can browse the model to understand how scoring works. However, to evaluate how accurately the model works, you need to score some records. Scoring records is the process of comparing the actual results to the responses that the model predicted. To evaluate the model, you can score the same records that were used to estimate the model. You can compare the observed and predicted responses by comparing the same records. Follow these steps to evaluate the model:

  1. Attach the Table node to the model nugget.
  2. Hover over the Table node, and click the Run icon Run icon.
  3. In the Outputs and models pane, click the output results with the name Table to view the results.

    The table displays the predicted scores in the $R-Credit rating field, which the model created. You can compare these values to the original Credit rating field that contains the actual responses.

    By convention, the names of the fields that were generated during scoring are based on the target field, but with a standard prefix.
    • $G and $GE are prefixes for predictions that the Generalized Linear Model generates
    • $R is the prefix for predictions that the CHAID model generates
    • $RC is for confidence values
    • $X is typically generated by using an ensemble
    • $XR, $XS, $XF are used as prefixes in cases where the target field is a Continuous, Categorical, Set, or Flag field

    A confidence value is the model's own estimation, on a scale from 0.0 to 1.0, of how accurate each predicted value is.

    Figure 9. Table showing generated scores and confidence values
    Table showing generated scores and confidence values

    As expected, the predicted value matches the actual responses for many records, but not all. The reason for this is that each CHAID terminal node has a mix of responses. The prediction matches the most common one, but it is wrong for all the others in that node. (Recall the 18% minority of low-income customers who did not default.)

    To avoid this issue, you could continue splitting the tree into smaller and smaller branches until every node was 100% pure; all Good or Bad with no mixed responses. But such a model is complicated and is unlikely to generalize well to other data sets.

    To find out exactly how many predictions are correct, you could read through the table and tally the number of records where the value of the predicted field $R-Credit rating matches the value of Credit rating. However, it is easiest to use an Analysis node, which automatically tracks records where these values match.

  4. Connect the model nugget to the Analysis node.
  5. Hover over the Analysis node, and click the Run icon Run icon.
  6. In the Outputs and models pane, click the output results with the name Analysis to view the results.

    The analysis shows that for 1960 out of 2464 records (over 79%) the value that the model predicted matched the actual response.

    Figure 10. Analysis results comparing observed and predicted responses
    Analysis results comparing observed and predicted responses

    This result is limited by the fact that the records that you scored are the same ones that you used to estimate the model. In a real situation, you could use a Partition node to split the data into separate samples for training and evaluation. By using one sample partition to generate the model and another sample to test it, you can get a better indication of how well it generalizes to other data sets.

    You can use the Analysis node to test the model against records for which you already know the actual result. The next stage illustrates how you can use the model to score records for which you don't know the outcome. For example, this data set might include people who are not currently customers of the bank, but who are prospective targets for a promotional mailing.

Checkpoint icon Check your progress

The following image shows the flow with the output results. You are now ready to score the model with new data.

Complete modeler flow

Back to the top

Task 6: Score the model with new data

Earlier, you scored the records that were used to estimate the model so that you could evaluate how accurate the model was. This example scores a different set of records from the ones used to create the model. Evaluating accuracy is one of the goals of modeling with a target field. You study records for which you know the outcome to identify patterns so that you can predict outcomes that you don't yet know.

You can update the existing Data Asset or Import node to point to a different data file. Or you can add a Data Asset or Import node that reads in the data you want to score. Either way, the new data set must contain the same input fields that are used by the model (Age, Income level, Education, and so on), but not the target field Credit rating.

Alternatively, you can add the model nugget to any flow that includes the expected input fields. Whether read from a file or a database, the source type does not matter if the field names and types match the ones that are used by the model.

Checkpoint icon Check your progress

The following image shows the completed flow.

Complete modeler flow

Back to the top

Summary

The Introduction to Modeling example flow demonstrates the basic steps for creating, evaluating, and scoring a model.

  • The Modeling node estimates the model by studying records for which the outcome is known, and creates a model nugget. This process is sometimes referred to as training the model.
  • The model nugget can be added to any flow with the expected fields to score records. By scoring the records for which you already know the outcome (such as existing customers), you can evaluate how well it performs.
  • After you're satisfied that the model performs acceptably, you can score new data (such as prospective customers) to predict how they will respond.
  • The data used to train or estimate the model can be referred to as the analytical or historical data. The scoring data might also be referred to as the operational data.

Next steps

You are now ready to try other SPSS Modeler tutorials.