Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Integrating SPSS Model Scoring in InfoSphere Streams, Part 2: Using a generic operator

Mike Koranda (koranda@us.ibm.com), InfoSphere Streams Release Manager, IBM
Mike Koranda
Mike Koranda is a senior technical staff member in IBM's Software Group and has been working at IBM for more than 30 years. He has been working in the development of the InfoSphere Streams product for the past six years.

Summary:  Part 2 of this "Integrating SPSS Model Scoring in InfoSphere Streams" series shows how to create a generic operator to execute IBM® SPSS Modeler predictive models in an InfoSphere® Streams application. It builds off the work of the non-generic operator produced in Part 1, where we wrote and used an InfoSphere Streams operator to execute a predictive model in an InfoSphere Streams application using the IBM SPSS Modeler Solution Publisher Runtime library API.

View more content in this series

Date:  06 Oct 2011
Level:  Intermediate PDF:  A4 and Letter (335 KB | 23 pages)Get Adobe® Reader®

Activity:  12740 views
Comments:  

Overview

Introduction

This tutorial shows how to create a generic operator to execute SPSS predictive models in an InfoSphere Streams application. It builds off the work of the non-generic operator produced in Part 1, where we wrote and used an InfoSphere Streams operator to execute an IBM SPSS Modeler predictive model in an InfoSphere Streams application using the IBM SPSS Modeler Solution Publisher Runtime library API.


Recap from Part 1

In Part 1, we developed an operator that wrapped a specific predictive model and would only work with the exact schema as shown in the example developed. Recall that we showed how that non-generic operator could be modified to accommodate different models and schemas, but that required at least some C++ programming skill to accomplish the necessary adjustments. Creating a generic operator that can automatically adjust itself to different inputs and different models will allow integration to be totally accomplished by a Streams application programmer, eliminating the work and skill needed to create individual operators for each different model to be used.

Roles and terminology

The roles and terminology are provided in Part 1 and are not repeated here. Our focus here is on the Streams component developer role and how to write a generic operator that can be tailored by a Streams application developer to execute SPSS predictive models. For information about the other roles to understand the work and interaction necessary for the overall solution, refer to Part 1.


The contract between data analyst and Streams component developer

Recall from Part 1 that in order to write the Streams operator, the Streams component developer needs to know certain information about the inputs and outputs of the predictive model produced by the data analyst. Specifically, the operator developer will require:

  • Install location of Solution Publisher
  • The .pim and .par files produced during the publish
  • The input source node key name. This can be found in the XML fragment:
    <inputDataSources>
        <inputDataSource name="file0" type="Delimited">
    

    NOTE: While there is no technical limitation, our example is limited to supporting a single input source for simplicity.
  • The input field names and storage and their order as found inside the <inputDataSource> tag.

    Listing 1. <inputDataSource> tag
    
    <fields>
      <field storage="string" type="discrete">
        <name>sex</name>
      </field>
     <field storage="integer" type="range">
      <name>income</name>
     </field>
    </fields>

  • The output node or terminal node key name.
    <outputDataSources>
        <outputDataSource name="file3" type="Delimited">
    

    NOTE: While there is no technical limitation, our example is limited to supporting a single output node for simplicity.
  • The output field names and storage and their order as found inside the <outputDataSource> tag.


    Listing 2. <outputDataSource> tag
    
    <fields>
     <field storage="string" type="discrete">
       <name>sex</name>
     </field>
     <field storage="integer" type="range">
      <name>income</name>
     </field>
     <field storage="string" type="flag">
      <name>$C-beer_beans_pizza</name>
    ...
     </field>
     <field storage="real" type="range">
      <name>$CC-beer_beans_pizza</name>
    ...
     </field>
    </fields>

Also recall that to adjust the sample operator to work for a different input tuple/model combination, you needed to adjust it in the following spots:

  1. _h.cgt file
    Adjust the output structure
  2. _cpp.cgt file
    1. next_record_back —
      1. Adjust the output structure
    2. constructor
      1. Adjust the .pim and .par filename and locations
      2. Adjust the input and output file tags
      3. Adjust the input field pointer array size
    3. process
      1. Adjust load input structure code
      2. Adjust load output tuple code

Next, we describe the design of our generic operator that will use the metadata and operator's parameters to automatically generate the right code that in the non-generic operator needed to manual adjustment.

2 of 9 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Big data
ArticleID=763045
TutorialTitle=Integrating SPSS Model Scoring in InfoSphere Streams, Part 2: Using a generic operator
publish-date=10062011
author1-email=koranda@us.ibm.com
author1-email-cc=