Quick start: Build a model using SPSS Modeler

You can create, train, and deploy models using SPSS Modeler. Read about SPSS Modeler, then watch a video and follow a tutorial that’s suitable for beginners and requires no coding.

Required service Watson Studio (which includes SPSS Modeler)

Your basic workflow includes these tasks:

Create a project. Projects are where you can collaborate with others to work with data.
Add an SPSS Modeler flow to the project.
Configure the nodes on the canvas, and run the flow.
Review the model details and save the model.
Deploy and test your model.

Read about SPSS Modeler

With SPSS Modeler flows, you can quickly develop predictive models using business expertise and deploy them into business operations to improve decision making. Designed around the long-established SPSS Modeler client software and the industry-standard CRISP-DM model it uses, the flows interface supports the entire data mining process, from data to better business results.

SPSS Modeler offers a variety of modeling methods taken from machine learning, artificial intelligence, and statistics. The methods available on the node palette allow you to derive new information from your data and to develop predictive models. Each method has certain strengths and is best suited for particular types of problems.

Watch a video about creating a model using SPSS Modeler

Watch Video Watch this video to see how to create and run an SPSS Modeler flow to train a machine learning model.

Video disclaimer: Some minor steps and graphical elements in this video differ from your Cloud Pak for Data deployment. This video shows the Cloud Pak for Data as a Service user interface.

This video provides a visual method as an alternative to following the written steps in this documentation.

Try a tutorial to create a model using SPSS Modeler

In this tutorial, you will complete these tasks:

Create a project.
Add a data set to your project.
Create the SPSS Modeler flow.
Add the nodes to the SPSS Modeler flow.
Run the SPSS Modeler flow and explore the model details.
Evaluate the model.
Deploy and test the model with new data.

This tutorial will take approximately 30 minutes to complete.

Example data

The data set used in this tutorial is from the University of California, Irvine, and is the result of an extensive study based on hospital admissions over a period of time. The model will use three important factors to help predict chronic kidney disease.

Task 1: Create a project

You need a project to store the SPSS Modeler flow.

If you have an existing project, open it. If you don't have an existing project, click Create a project on the home page or click New project on your Projects page.
Select Analytics project as the project type.
Select Create an empty project.
On the Create a project screen, add a name and optional description for the project.
Click Create.

For more information or to watch a video, see Creating a project.

Task 2: Add the data set to your project

The data set used in this tutorial is available in the Gallery.

Download the chronic_kidney_disease_full.csv file (39 KB) file.
Add the chronic_kidney_disease_full.csv file to your project:
1. From your project, click Add to project > Data.
2. In the Load pane that opens, browse to select the chronic_kidney_disease_full.csv file, and click Open. Stay on the page until the load completes. The chronic_kidney_disease_full.csv file is added to your project as a data asset.
From your project's Assets page, locate the chronic_kidney_disease_full.csv file.

Task 3: Create the SPSS Modeler flow

Now add the SPSS Modeler flow to the project.

Click Add to project, and select Modeler flow.
Type a name and description for the flow.
Click Create. This opens up the Flow Editor that you'll use to create the flow.

Task 4: Add the nodes to the SPSS Modeler flow

After you load the data, you must transform the data. You'll be creating a simple flow by dragging transformers and estimators onto the canvas and connecting them to the data source. Use the following nodes from the palette:

Data Asset: loads the csv file from the project
Partition: divides the data into training and testing segments
Type: sets the data type. Use it to designate the class field as a target type.
C5.0: a classification algorithm
Analysis: view the model and check its accuracy
Table: preview the data with predictions
From the Import section, drag the Data Asset node onto the canvas.
1. Double-click the Data Asset node to select the data set.
2. Click Change data asset in the pane that opens.
3. Select Data assets in the page that opens.
4. Select chronic_kidney_disease_full.csv.
5. Click OK.
6. View the Data Asset properties.
7. Click Save.
From the Field Operations section, drag the Partition node onto the canvas.
1. Connect the Data Asset node to the Partition node.
2. Double-click the Partition node to view its properties. The default partition divides half of the data for training and the other half for testing.
3. Click Save.
From the Field Operations section, drag the Type node onto the canvas.
1. Connect the Partition node to the Type node.
2. Double-click the Type node to view its properties. The Type node specifies the measurement level for each field. This source data file uses four different measurement levels: Continuous, Categorical, Nominal, Ordinal, and Flag.
3. Search for the class field. For each field, the role indicates the part that each field plays in modeling. Change the class Role to Target - the field you want to predict.
4. Click Save.
From the Modeling section, drag the C5.0 node onto the canvas.
Connect the Type node to the C5.0 node.
1. Double-click the C5.0 node to view its properties. By default, the C5.0 algorithm builds a decision tree. A C5.0 model works by splitting the sample based on the field that provides the maximum information gain. Each sub-sample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples can't be split any further. Finally, the lowest-level splits are reexamined, and those that don't contribute significantly to the value of the model are removed.
2. Check Use custom field roles.
3. For Target, select class.
4. In the Inputs section, click Add columns.
5. Select age, sc, dm.
6. Click OK.
7. Click Save.

When you're done creating the flow, it should look like the following image.

flow showing Data Asset node, Partition node, Type node, and C5.0 class node

Task 5: Run the SPSS Modeler flow and explore the model details

Now that you have designed the flow, you can run the flow and examine the tree diagram to see the decision points.

Right-click the C5.0 node and select Run. Running the flow generates a new model nugget on the canvas.
Right-click the model nugget and select View Model to view the model details.
View the Model Information which provides a model summary.
Click Top Decision Rules. A table displays a series of rules that were used to assign individual records to child nodes based on the values of different input fields.
Click Feature Importance. A chart shows the relative importance of each predictor in estimating the model. From this, you can see that serum creatinine is easily the most significant factor, with diabetes being the next most significant factor.
Click Tree Diagram. The same model is displayed in the form of a tree, with a node at each decision point.
1. Select the Display labels on branches option.
2. Hover over Node 0 which provides a summary for all the records in the data set. Just under 40% of the cases in the data set are classified as not diagnosed with kidney disease. The tree can provide additional clues as to what factors might be responsible.
3. Notice the two branches stemming from Node 0, which indicates a split by serum creatinine.
4. Hover over Node 6 which shows records where the serum creatinine is greater than 1.25. In this case, 100% of those patients have a positive kidney disease diagnosis.
5. Hover over Node 1 which shows records where the serum creatinine is less than or equal to 1.25. Almost 80% of those patients don't have a positive kidney disease diagnosis, but almost 20% with lower serum creatinine were still diagnosed with kidney disease.
6. The branch from Node 1 is split by diabetes. Hover over Node 2 which shows patients with low serum creatinine and diagnosed diabetes. 100% of these patients were also diagnosed with kidney disease.
7. Hover over Node 3. For patients with low serum creatinine and no diabetes, over 85% were not diagnosed with kidney disease, but 15% of them were still diagnosed with kidney disease.
8. The branch from Node 3 is split by the last significant factor, age. Hover over Node 4 to see that 75% of young patients with low serum creatinine and no diabetes were at risk of getting kidney disease.
9. Hover over Node 5. Only 11% of patients over 16 years old with low serum creatinine and no diabetes were at risk of getting kidney disease.
10. Use the breadcrumb navigation to navigate back to your model.

Task 6: Evaluate the model

Use the Analysis and Table nodes to evaluate the model.

From the Outputs section, drag the Analysis node onto the canvas.
Connect the Model nugget to the Analysis node.
Right-click the Analysis node, and select Run.
From the Outputs panel, open the Analysis, which shows that the model correctly predicted a kidney disease diagnosis alomst 95% of the time. Close the Analysis.
Right-click the Analysis node, and select Save branch as a model.
1. For the Model name, type Kidney Disease Analysis.
2. Click Save.
From the Outputs section, drag the Table node onto the canvas.
1. Connect the Model nugget to the Table node.
2. Right-click the Table node, and select Preview.
3. When the Preview displays, scroll to the last two columns. The $C-Class column contains the prediction of kidney disease, and the $CC-Class column indicates the confidence score for that prediction.
4. Close the Preview.

Task 7: Deploy and test the model with new data

Lastly, you can deploy this model and predict the outcome with new data.

Return to the Project's Assets tab.
Scroll to the Models section, and open the Kidney Disease Analysis model.
Click Promote to deployment space.
Choose an existing deployment space. If you don't have a deployment space, you can create a new one:
1. Provide a space name.
2. Click Create.
3. Click Close.
Click the deployment space link that appears or use the navigation menu to navigate to Deployments and select your deployment space.
Hover over the model and click the rocket icon to deploy the model.
1. Select Online as the Deployment type.
2. Specify a name for the deployment.
3. Click Create.
Go to the Deployments tab and wait for the model to be deployed.
When the deployment is complete, click the deployment name to view the deployment details page.
Go to the Test tab. You can test the deployed model from the deployment details page in two ways: test with a form or test with JSON code.

Click the icon to Provide input data as JSON, then copy the following test data and paste it in the area for the JSON text:

{"input_data":[{"fields":["age","bp","sg","al","su","rbc","pc","pcc","ba","bgr","bu","sc","sod","pot","hemo","pcv","wbcc","rbcc","htn","dm","cad","appet","pe","ane","class"], "values":[["62","80","1.01","2","3","normal","normal","notpresent","notpresent","423","53","1.8","","","9.6","31","7500","","no","yes","no","poor","no","yes","ckd"]]}]}

Click Predict to predict whether a 62 year old with diabetes and a serum creatinine ratio of 1.8 would likely be diagnosed with kidney disease. The resulting prediction indicates that this patient has a high probability of a kidney disease diagnosis.

Next steps

Now you can use this data set for further analysis. For example, you can perform tasks such as:

Additional resources

Find more SPSS Modeler tutorials
View videos about machine learning
Try these additional tutorials to get more hands-on experience with building models in notebooks and using AutoAI:
- Build models using Jupyter notebooks
- Automate model building in Watson Studio