IBM Business Analytics Proven Practices: Introduction to Python Scripting in IBM SPSS Modeler

Product(s): IBM SPSS Modeler; Area of Interest: Scripting

This document provides an introduction to Python scripting in IBM SPSS Modeler.

Share:

Julian Clinton, Software Engineer, IBM

Julian Clinton works for IBM on the SPSS branded products and has over 25 years experience in software development. He has worked on designing and developing SPSS Modeler for over 15 years and has also worked on Analytical Decision Management, primarily on CPLEX optimization integration. He is also co-author of CRISP-DM, the de-facto standard process for data mining.



16 April 2014

Also available in Russian

Introduction

Purpose of Document

This document provides an introduction to Python scripting in IBM SPSS Modeler.

Applicability

The examples outlined in this document require IBM SPSS Modeler 16 or later. The examples were tested using IBM SPSS Modeler 16 which uses Jython 2.5.1 to provide the Python language support.

Exclusions and Exceptions

This document provides an overview of the IBM SPSS Modeler 16 scripting features but is not intended to be a complete description of all available functionality.

Assumptions

This document assumes readers have experience with IBM SPSS Modeler. Ideally readers should also have some experience of the legacy Modeler scripting language and/or some experience of Python.


Overview

IBM SPSS Modeler has included scripting support for many years as a way of providing advanced control over the configuration and execution of streams, and the management or execution results. From Modeler 16, Python has become the default language for scripting, replacing the bespoke scripting language (although that is still available for compatibility reasons).

This document provides an overview of Python scripting Modeler 16 by developing a basic stream script into a more sophisticated example. The document covers both the scripting Application Programmer Interface (API) and the scripting environment within the Modeler User Interface (UI).


Script Types

Modeler supports three types of script:

  • stream scripts: these are used to control execution of a single stream and are stored within the stream
  • supernode scripts: these are used to control the behaviour of supernodes and are stored within the supernode
  • standalone (or session) scripts: these can be used to co-ordinate execution across a number of different streams and are stored as text files

In this tutorial, we will convert the standard "druglearn" example stream to use a stream script for execution and then extend the functionality of that script to perform additional tasks.


Step 1: Converting The "druglearn" Stream To Use Script Execution

We first need to open the "druglearn" example stream.

  1. From the main Modeler window "File" menu, select "Open Stream..." and navigate to the "Demos/streams" folder under the Modeler installation folder.
  2. Locate and select "druglearn.str", then press OK.

Within the Modeler UI, you should now see the stream shown in Figure 1.

Figure 1: The 'druglearn' Stream
Figure 1: The 'druglearn' Stream

We now want to change the stream to use a script rather than directly executing the terminal nodes.

  1. From the Modeler main window "Tools" menu, locate the "Stream Properties" sub-menu and select "Execution...". You should then see a window containing the script area and an output area containing "Messages" and "Debug" sub-tabs as shown in Figure 2.
    Figure 2: The Execution Tab
    Figure 2: The Execution Tab
  2. Ensure the script language is set to Python by setting the radio control above the script area. Because the "druglearn" stream was created in a version of Modeler before version 16, it will have defaulted to using the legacy scripting language.
  3. Generate a default script to execute the terminal nodes in the stream. To do this, locate the button to generate the default script in the button bar above the script area (shown highlighted in Figure 3).
    Figure 3: The 'Insert Default Script' Button
    Figure 3: The 'Insert Default Script' Button
    Click it and you will see the default script has been added to the script area as shown in Figure 4.
    Figure 4: The Default Script Inserted In The Script Area
    Figure 4: The Default Script Inserted In The Script Area
  4. To ensure the script is used as the standard execution method when the stream is executed, find the "On stream execution" control at the bottom of the "Execution" tab and select "Run this script". Then click the dialog's "OK" button to commit the changes and dismiss the dialog.

We can now run the stream using the "Tools" menu "Run" option. The script will be executed and the new models will be produced as before.

You may want to save the stream with a different name via the "File" -> "Save As..." option to preserve what you have achieved so far.


Looking At The Script

Before going further, it's worth reviewing the script.

diagram = modeler.script.diagram()

This stores the diagram that the script is associated with in a local variable called "diagram". Note that the "modeler.script" module is automatically imported into any Modeler script and provides functions that allow scripts to access their operating context. For a stream script, the diagram() function returns the stream that the stream script is being executed in.

diagram.findByID("id5IKMPB9VL87").run(None)	# "Drug":c50
diagram.findByID("id8Z7NXVQQMR8").run(None)	# "Drug":neuralnetwork

These two lines search the diagram for the nodes with the specified IDs. In this case the nodes are the two model builders at the end of the stream.

Rather than searching by ID, we could search by node type e.g.:

diagram.findByType("c50", None).run(None)
diagram.findByType("neuralnetwork", None).run(None)

The first parameter takes the type of node to be located while the second takes the node label. Note: be careful when searching using the node label. Nodes such as source nodes, model builders and graph nodes allow auto-labelling which means the labels can change when values in the node change. Another consideration is if a script needs to run in multiple locales since the default node labels may be translated differently. In these situations either search using the specific node ID or specify a custom label via the "setLabel()" function e.g.:

node.setLabel("Node1")

The nodes are then executed using the "run()" function.


Using The Debug Tab

The "Debug" sub-tab allows you to query the state of a script that has been run via the "Run Script" button highlighted in Figure 5.

Figure 5: The 'Run Script' Button
Figure 5: The ‘Run Script’ Button

If you run the default script using the button rather than executing the stream via the main window, the environment of the script (i.e. the objects and variables created during that execution) is maintained and can be queried using the "Debug" tab's evaluation command line shown in Figure 6.

Figure 6: The Evaluation Command Line
Figure 6: The Evaluation Command Line

For example, the default script creates a new variable called "diagram". We can print the value of that variable by typing a Python command into the command line:

print diagram

and then pressing the ENTER key. The command is echoed in the text area above the command line (prefixed with the ">>>" prompt) along with the result of evaluating the command as shown in Figure 7.

Figure 7: The Results Of Evaluating An Expression
Figure 7: The Results Of Evaluating An Expression

We can continue to evaluate Python expressions to help debug the script or to check that additions to the script will work as expected.

Note that the command line maintains a command history so by pressing the up or down keys, you can access and edit previous commands.


Other Tools For Building Scripts

Before going on to modify the script, it's worth covering a few additional features that can help when developing scripts.

Find/Replace

The Find/Replace dialog allows you to search and optionally replace text items. It can be invoked via the "Edit Options" menu highlighted in Figure 8.

Figure 8: The ‘Edit Options’ Menu Button
Figure 8: The ‘Edit Options’ Menu Button

or by the CTRL+F keyboard shortcut.

If you have already specified some text to be searched for, pressing F3 will find the next occurrence of that text.

Auto-Suggest

Auto-suggest provides a quick way of looking up possible function names and syntax. By typing the first few letters and then pressing CTRL+SPACE, a popup menu will appear with possible completions based on the characters typed. For example, if you want to get a list of the functions beginning with "find", you can type "find" then press CTRL+SPACE and the popup menu shown in Figure 9 will appear.

Figure 9: The Auto-suggest Popup Menu
Figure 9: The Auto-suggest Popup Menu

You can then use the cursor keys to move up and down the list. When you have chosen the correct one, you press ENTER. Alternatively, you can dismiss the popup by pressing the ESC key.

Note that the auto-suggest popup displays any commands and functions starting with the specified characters even if those commands are not valid for the specific object in the script. The name after the hyphen ("-") specifies the type of object that the function is defined for to help you identify relevant functions.

Block Indent/Unindent

In Python, indentation is significant in defining the scope of code blocks. The script area provides a way of moving selected lines of text to increase or decrease the indentation. You can select a block of Python text so that it is highlighted as shown in Figure 10.

Figure 10: Highlighting A Block Of Text To Indent
Figure 10: Highlighting A Block Of Text To Indent

If you then press TAB, this will increase the indentation as shown in Figure 11.

Figure 11: After Indenting The Text
Figure 11: After Indenting The Text

Similarly, pressing SHIFT+TAB will decrease the indentation.

Toggle Comments

Sometimes it is useful to replace or disable some code temporarily. This can be achieved quickly by selecting the relevant lines as shown in Figure 12.

Figure 12: Highlighting A Block Of Text To Comment Out
Figure 12: Highlighting A Block Of Text To Comment Out

You can then use the CTRL+T shortcut to toggle comments and get the results shown in Figure 13.

Figure 13: After Commenting Out The Text Block
Figure 13: After Commenting Out The Text Block

Modifying The Syntax Highlighting Behavior

If you wish to change the way script and CLEM expressions are display, find the main window "Tools" menu and on the "Options" sub-menu, select "User Options...". On the "Syntax" tab you can modify the colors and fonts styles used by the syntax highlighter.

Figure 14: Syntax Highlighting Preferences
Figure 14: Syntax Highlighting Preferences

The tab includes a preview which shows how your changes will affect the display.


Step 2: Scoring Data With The Built Model

Earlier we generated a default script for the "druglearn" example stream. Since this is the default script, it doesn't provide any additional functionality over non-script execution i.e. it simple executes each terminal node. The next step is to capture the models that were built, create model appliers for those model outputs and then score the data into a display table.

Objects that are built via stream execution can be captured in a list passed to the "run" function. For example:

c50node = diagram.findByType("c50", None)
results = []
c50node.run(results)

Assuming execution completed correctly, results will now contain the model output object created by the algorithm. A corresponding model applier node can be added to the stream, connected to the model builder's predecessor and then a table node can be connected to the model applier.

Python includes a "len" function which returns the length of a sequence such as a list which the script can use to check that some results were produced.

if len(results) > 0:
    ...

Note: the level of indentation of text in Python is significant in determining the scope of a block of code. The Modeler script area indents code using the space character so if code is copied from another source, ensure that indentation also uses spaces rather than, say, tabs. If the copied code uses tabs, a Python exception will be thrown when the script is executed, even though the pasted code appears to be at the same indentation level.

It can be useful to ensure that nodes are created in a position that allows them to be seen easily when the stream is being viewed in the Modeler UI. To do this, the new model applier node will be positioned below the model builder. We can use the node's getXPosition() and getYPosition() functions to find out the builder's position and then offset the model applier by a certain amount:

if len(results) > 0:
    x = c50node.getXPosition()
    y = c50node.getYPosition()
    label = c50node.getLabel()
    applyc50 = diagram.createModelApplierAt(results[0], label, x, y+96)

We can then attach the model applier to the model builder's predecessor. Since the majority of non-source nodes can only be connected to single input, we can use the diagram's "predecessorAt" function to access the predecessor node:

    pred = diagram.predecessorAt(c50node, 0)

We can then connect the model applier to it:

    diagram.link(pred, applyc50)

Finally we will create a new table node attached to the model applier, positioning it just after the model applier:

    y = applyc50.getYPosition()
    tablenode = diagram.createAt("table", "ScoreOutput", x+96, y)
    diagram.link(applyc50, tablenode)

The whole script looks like this:

diagram = modeler.script.diagram()
c50node = diagram.findByType("c50", None)
results = []
c50node.run(results)
if len(results) > 0:
    x = c50node.getXPosition()
    y = c50node.getYPosition()
    label = c50node.getLabel()
    applyc50 = diagram.createModelApplierAt(results[0], label, x, y+96)
    pred = diagram.predecessorAt(c50node, 0)
    diagram.link(pred, applyc50)
    y = applyc50.getYPosition()
    tablenode = diagram.createAt("table", "ScoreOutput", x+96, y)
    diagram.link(applyc50, tablenode)

After the script has been run, the resulting stream should like Figure 15.

Figure 15: The Stream After Script Execution
Figure 15: The Stream After Script Execution

Step 3: Aggregating The Predicted Values

In this example, we will focus on the C5.0 builder node. To begin with, we will delete the Neural Network node and move the C5.0 builder node after the Type node using the UI.

We are then going to modify the script so that the stream aggregates the mean, minimum and maximum confidences for each of the predicted values and sorts the results into a predictable order.

The stream we are aiming to produce is shown in Figure 16.

Figure 16: The Target Stream
Figure 16: The Target Stream

Since we're going to create a sequence of nodes connected and positioned after each other, it makes sense to define functions to do that in order to make the script smaller and easier to maintain.

First we will define utility functions that provide a more concise way of looking up a node's predecessor, and for positioning and linking a node either below of after another node. The predecessor function looks like this:

def predecessor(node):
    return node.getProcessorDiagram().predecessorAt(node, 0)

As we saw previously, the predecessor node is accessed via the diagram that owns the node and this can be access using the node's "getProcessorDiagram" function.

Next we will define a function that positions a node at an offset relative to some other node and links the nodes together:

def addRelative(pred, node, xdiff, ydiff):
    x = pred.getXPosition()
    y = pred.getYPosition()
    node.setXYPosition(x+xdiff, y+ydiff)
    pred.getProcessorDiagram().link(pred, node)

Then we create a couple of convenience functions that add nodes either after or below a node:

def addAfter(pred, node):
    addRelative(pred, node, 96, 0)

def addBelow(pred, node):
    addRelative(pred, node, 0, 96)

We now start on the main part of the script. The script that controls the initial model building and adding the model applier is largely the same as before except that we assign the model output to a variable and use the "predecessor" and "addBelow" functions we just defined:

diagram = modeler.script.diagram()
c50node = diagram.findByType("c50", None)
results = []
c50node.run(results)
if len(results) > 0:
    c50model = results[0]
    pred = predecessor(c50node)
    label = c50node.getLabel()
    applyc50 = diagram.createModelApplier(c50model, label)
    addBelow(pred, applyc50)

Note also that we are using "createModelApplier" rather than "createModelApplierAt" because the node position will be set by the "addBelow" function.

Next we will add the aggregate node which will be used to compute the mean, minimum and maximum confidences for each value the model is predicting. For C5.0 models, the predicted value will be in a field called "$C-Drug" and the confidence will be a field called "$CC-Drug":

    # Create the aggregate node.
    aggregate = diagram.create("aggregate", "Aggregate")
    # Group by the predicted values
    aggregate.setPropertyValue("keys", ["$C-Drug"])
    # Aggregate the confidence to give Mean, Min and Max
    aggregate.setKeyedPropertyValue("aggregates", "$CC-Drug", ["Mean", "Min", "Max"])
    # We don't need the record count column
    aggregate.setPropertyValue("inc_record_count", False)

We then add it after the model applier:

    addAfter(applyc50, aggregate)

Next we insert a sort node. This sorts the predicted values to ensure a consistent output order. While it isn't strictly necessary, it does help keep the ordering consistent across different data sets:

    # Create the sort node
    sort = diagram.create("sort", "Sort")
    # Sort the data by the predicted value
    sort.setPropertyValue("keys", [["$C-Drug", "Ascending"]])
    addAfter(aggregate, sort)

Then we add the table output:

    # Finally create the table node and run it
    tablenode = diagram.create("table", "AggregateOutput")
    addAfter(sort, tablenode)
    tablenode.run(None)

When we run the script, we get a table output object show in Figure 17.

Figure 17: Table Output Showing The Aggregated Values
Figure 17: Table Output Showing The Aggregated Values

The whole script is shown below:

def predecessor(node):
    return node.getProcessorDiagram().predecessorAt(node, 0)

def addRelative(pred, node, xdiff, ydiff):
    x = pred.getXPosition()
    y = pred.getYPosition()
    node.setXYPosition(x+xdiff, y+ydiff)
    pred.getProcessorDiagram().link(pred, node)

def addAfter(pred, node):
    addRelative(pred, node, 96, 0)

def addBelow(pred, node):
    addRelative(pred, node, 0, 96)

diagram = modeler.script.diagram()
c50node = diagram.findByType("c50", None)
results = []
c50node.run(results)
if len(results) > 0:
    c50model = results[0]
    pred = predecessor(c50node)
    label = c50node.getLabel()
    applyc50 = diagram.createModelApplier(c50model, label)
    addBelow(pred, applyc50)
    # Create the aggregate node.
    aggregate = diagram.create("aggregate", "Aggregate")
    # Group by the predicted values
    aggregate.setPropertyValue("keys", ["$C-Drug"])
    # Aggregate the confidence to give Mean, Min and Max
    aggregate.setKeyedPropertyValue("aggregates", "$CC-Drug", ["Mean", "Min", "Max"])
    # We don't need the record count column
    aggregate.setPropertyValue("inc_record_count", False)
    addAfter(applyc50, aggregate)

    # Create the sort node
    sort = diagram.create("sort", "Sort")
    # Sort the data by the predicted value
    sort.setPropertyValue("keys", [["$C-Drug", "Ascending"]])
    addAfter(aggregate, sort)

    # Finally create the table node and run it
    tablenode = diagram.create("table", "AggregateOutput")
    addAfter(sort, tablenode)
    tablenode.run(None)

Step 4: Validating The Model

Having generated the table showing the various aggregation metrics, we will update the script to iterate through the values in the table to determine whether the various aggregated metrics meet our quality thresholds.

The first step is to capture the table output in the same way that we captured the original model i.e. by passing a results list to the table node's "run" function by replacing:

    tablenode.run(None)

with:

    results = []
    tablenode.run(results)

The result is a table output object. Table output objects include a row set data model that we can use to access the different values in the table. From the configuration of the aggregate node, we know that the table will contain four columns:

  • the predicted value (the aggregation key field)
  • the mean confidence
  • the minimum confidence
  • the maximum confidence

We also know that the simple data set we use has five distinct values that can be predicted. However, we can't guarantee that the data set will necessarily contain cases with all these values so we need to use the number of rows from the table output to limit the loop:

    results = []
    tablenode.run(results)
    # Now process the stream outputs
    tableoutput = results[0]    # assume a table was generated
    rowset = tableoutput.getRowSet()
    rowcount = rowset.getRowCount()

Note that the row set is stored as distinct object within the table output.

Next we iterate through rows in the data set, extracting the values required.

    isvalid = True
    row = 0
    while row < rowcount:
        predicted = rowset.getValueAt(row, 0)
        meanval = rowset.getValueAt(row, 1)
        minval = rowset.getValueAt(row, 2)
        maxval = rowset.getValueAt(row, 3)

If the values do not meet our validation requirements then the script sets an indicator flag and exits the loop immediately:

        # Arbitrarily decide that the mean must be
        # greater than 0.85, otherwise we discard the model
        if meanval > 0.85:
            row += 1
        else:
            isvalid = False
            break

After the loop, the script should check the status flag and if the model is valid then the model gets saved:

    # If the model is valid, save it to file
    if isvalid:
        taskrunner = modeler.script.session().getTaskRunner()
        path = "c:/temp/models/" + c50model.getLabel() + ".gm"
        taskrunner.saveModelToFile(c50model, path)

Using the standard Modeler DRUG1n demo data set, a couple of the predicted values are lower than 0.85 so the model will not be saved. If you want to ensure that the model gets saved, you can lower the threshold to, say, 0.80.

The whole script is shown below:

def predecessor(node):
    return node.getProcessorDiagram().predecessorAt(node, 0)

def addRelative(pred, node, xdiff, ydiff):
    x = pred.getXPosition()
    y = pred.getYPosition()
    node.setXYPosition(x+xdiff, y+ydiff)
    pred.getProcessorDiagram().link(pred, node)

def addAfter(pred, node):
    addRelative(pred, node, 96, 0)

def addBelow(pred, node):
    addRelative(pred, node, 0, 96)

diagram = modeler.script.diagram()
c50node = diagram.findByType("c50", None)
results = []
c50node.run(results)
if len(results) > 0:
    c50model = results[0]
    pred = predecessor(c50node)
    label = c50node.getLabel()
    applyc50 = diagram.createModelApplier(c50model, label)
    addBelow(pred, applyc50)
    # Create the aggregate node.
    aggregate = diagram.create("aggregate", "Aggregate")
    # Group by the predicted values
    aggregate.setPropertyValue("keys", ["$C-Drug"])
    # Aggregate the confidence to give Mean, Min and Max
    aggregate.setKeyedPropertyValue("aggregates", "$CC-Drug", ["Mean", "Min", "Max"])
    # We don't need the record count column
    aggregate.setPropertyValue("inc_record_count", False)
    addAfter(applyc50, aggregate)

    # Create the sort node
    sort = diagram.create("sort", "Sort")
    # Sort the data by the predicted value
    sort.setPropertyValue("keys", [["$C-Drug", "Ascending"]])
    addAfter(aggregate, sort)

    # Finally create the table node and run it
    tablenode = diagram.create("table", "AggregateOutput")
    addAfter(sort, tablenode)
    results = []
    tablenode.run(results)
    # Now process the stream outputs
    tableoutput = results[0]    # assume a table was generated
    rowset = tableoutput.getRowSet()
    rowcount = rowset.getRowCount()
    isvalid = True
    row = 0
    while row < rowcount:
        predicted = rowset.getValueAt(row, 0)
        meanval = rowset.getValueAt(row, 1)
        minval = rowset.getValueAt(row, 2)
        maxval = rowset.getValueAt(row, 3)
        # Arbitrarily decide that the mean must be
        # greater than 0.85, otherwise we discard the model
        if meanval > 0.85:
            row += 1
        else:
            isvalid = False
            break

    # If the model is valid, save it to file
    if isvalid:
        taskrunner = modeler.script.session().getTaskRunner()
        path = "c:/temp/models/" + c50model.getLabel() + ".gm"
        taskrunner.saveModelToFile(c50model, path)

Conclusion

This has been a brief overview showing some of the features of Python scripting in Modeler. Hopefully you have a better understanding of both the functions that are available within a script and also about the tools within Modeler that will help you write those scripts.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=968619
ArticleTitle=IBM Business Analytics Proven Practices: Introduction to Python Scripting in IBM SPSS Modeler
publish-date=04162014