Integrating SPSS Model Scoring in InfoSphere Streams, Part 2: Using a generic operator

Part 2 of this "Integrating SPSS Model Scoring in InfoSphere Streams" series shows how to create a generic operator to execute IBM® SPSS Modeler predictive models in an InfoSphere® Streams application. It builds off the work of the non-generic operator produced in Part 1, where we wrote and used an InfoSphere Streams operator to execute a predictive model in an InfoSphere Streams application using the IBM SPSS Modeler Solution Publisher Runtime library API.

Share:

Mike Koranda (koranda@us.ibm.com), InfoSphere Streams Release Manager, IBM China

Mike KorandaMike Koranda is a senior technical staff member in IBM's Software Group and has been working at IBM for more than 30 years. He has been working in the development of the InfoSphere Streams product for the past six years.



06 October 2011

Also available in Chinese

Before you start

This tutorial describes how to create a generic operator that can be used from InfoSphere Streams applications to execute SPSS predictive models. It also provide a sample operator that can be directly used with any appropriate SPSS model and a sample Streams application that demonstrates its use.

About this series

InfoSphere Streams is a platform that enables real-time analytics of data in motion. The IBM SPSS family of products provide the ability to build predictive analytic models. This "Integrating SPSS Model Scoring in InfoSphere Streams" is for Streams developers who need to leverage the powerful predictive models in a real-time scoring environment.

About this tutorial

This tutorial extends on the non-generic operator produced in Part 1, which presents a. technique that is quite flexible but requires some C++ programming skill to customize.

Objectives

In this tutorial, you learn how the non-generic operator is extended to use the predictive model's XML metadata to allow use of a SPSS predictive model in Streams without C++ skill required.

Prerequisites

This tutorial is written for Streams component developers and application programmers who have Streams programming language skills and C++ skills. Use the tutorial as a reference, or the samples in it can be examined and executed to demonstrate the techniques described. To execute the samples, you should have a general familiarity with using a UNIX® command-line shell and working knowledge of Streams programming.

System requirements

To run the examples, you need a Red Hat Enterprise Linux® box with InfoSphere Streams V2.0 or later and IBM SPSS Modeler Solution Publisher 14.2 fixpack 1, plus the Solution Publisher hot fix, which is scheduled to be available 14 Oct 2011.


Overview

Introduction

This tutorial shows how to create a generic operator to execute SPSS predictive models in an InfoSphere Streams application. It builds off the work of the non-generic operator produced in Part 1, where we wrote and used an InfoSphere Streams operator to execute an IBM SPSS Modeler predictive model in an InfoSphere Streams application using the IBM SPSS Modeler Solution Publisher Runtime library API.

Recap from Part 1

In Part 1, we developed an operator that wrapped a specific predictive model and would only work with the exact schema as shown in the example developed. Recall that we showed how that non-generic operator could be modified to accommodate different models and schemas, but that required at least some C++ programming skill to accomplish the necessary adjustments. Creating a generic operator that can automatically adjust itself to different inputs and different models will allow integration to be totally accomplished by a Streams application programmer, eliminating the work and skill needed to create individual operators for each different model to be used.

Roles and terminology

The roles and terminology are provided in Part 1 and are not repeated here. Our focus here is on the Streams component developer role and how to write a generic operator that can be tailored by a Streams application developer to execute SPSS predictive models. For information about the other roles to understand the work and interaction necessary for the overall solution, refer to Part 1.

The contract between data analyst and Streams component developer

Recall from Part 1 that in order to write the Streams operator, the Streams component developer needs to know certain information about the inputs and outputs of the predictive model produced by the data analyst. Specifically, the operator developer will require:

  • Install location of Solution Publisher
  • The .pim and .par files produced during the publish
  • The input source node key name. This can be found in the XML fragment:
    <inputDataSources>
        <inputDataSource name="file0" type="Delimited">
    NOTE: While there is no technical limitation, our example is limited to supporting a single input source for simplicity.
  • The input field names and storage and their order as found inside the <inputDataSource> tag.
    Listing 1. <inputDataSource> tag
    <fields>
      <field storage="string" type="discrete">
        <name>sex</name>
      </field>
     <field storage="integer" type="range">
      <name>income</name>
     </field>
    </fields>
  • The output node or terminal node key name.
    <outputDataSources>
        <outputDataSource name="file3" type="Delimited">
    NOTE: While there is no technical limitation, our example is limited to supporting a single output node for simplicity.
  • The output field names and storage and their order as found inside the <outputDataSource> tag.
    Listing 2. <outputDataSource> tag
    <fields>
     <field storage="string" type="discrete">
       <name>sex</name>
     </field>
     <field storage="integer" type="range">
      <name>income</name>
     </field>
     <field storage="string" type="flag">
      <name>$C-beer_beans_pizza</name>
    ...
     </field>
     <field storage="real" type="range">
      <name>$CC-beer_beans_pizza</name>
    ...
     </field>
    </fields>

Also recall that to adjust the sample operator to work for a different input tuple/model combination, you needed to adjust it in the following spots:

  1. _h.cgt file
    Adjust the output structure
  2. _cpp.cgt file
    1. next_record_back —
      1. Adjust the output structure
    2. constructor
      1. Adjust the .pim and .par filename and locations
      2. Adjust the input and output file tags
      3. Adjust the input field pointer array size
    3. process
      1. Adjust load input structure code
      2. Adjust load output tuple code

Next, we describe the design of our generic operator that will use the metadata and operator's parameters to automatically generate the right code that in the non-generic operator needed to manual adjustment.


Designing the operator

Designing the Streams generic scoring operator

At a conceptual level, what the generic operator needs to do is provide parameters that a Streams application developer can use to tailor the operators use with different predictive models with differing input and output tuple formats. In this section, we will describe the parameters our generic operator provides and the design decisions that led to this set of parameters.

Our generic operator design starts with the decision to continue to only support a single input port and a single output port.

Recall that the .pim and .par filenames in our non-generic operator were hard coded in the .cgt file. To make our generic operator able to accept any user-specified model files, these need to be passed into the operator. Parameters will be used to allow their specification similar to how the SPInstall location was handled in the non-generic operator.

Recall that much of the information a Streams component developer needed to build the non-generic operator was obtained by examining the predictive model's XML metadata file. This file will be programatically read and used to dynamically generate the necessary operator code when an application using the generic operator is compiled. We will use a parameter similar to those used for the .pim and .par files to allow the XML metadata file specification.

To allow a Streams application developer to use the generic operator, a key requirement is the ability to specify which input tuple data is used to feed the input fields of the model. In our simple generic model, we hard-coded this and required that the tuple's attributes were compatible with the format expected by the models input fields. For our generic operator, we need to allow the application developer to indicate how to populate these input fields from the appropriate input data. To accomplish this, we define two parameters, each a list. The first list will contain SPL expressions that indicate the input values derived from the input tuple data to be used to populate the model's input fields. The second list indicates which model input field is populated from the corresponding expression in the first list.

By allowing SPL expressions, we allow greater flexibility in specifying the data to be used. In the simplest case, it would be a tuple attribute in the proper format — the same as was used in the non-generic operator. But by allowing a SPL expression, it can now be a complex expression made from several attributes to provide the correct input data needed. It could be a simple type conversion from the input tuple data available to the appropriate data type needed by the model, a complex expression that uses several attributes to derive the needed input field value, or it could be a literal value for cases where the model requires data that is not in the input tuple and a default value will suffice. While we could have gotten by with a single list and required the application developer to ensure that the expressions were listed in the exact order needed for the model, we thought that to be too error-prone and chose to require the application developer to provide the second parameter, which is an explicit list of the mapping to the existing field names. The second list parameter serves this purpose.

A similar set of information is needed to map from the model's output fields to the output tuple. We felt the most natural Streams implementation would be to specify this through the SPL output clause and the use of a custom output function in the output port of the operator. By doing this, we allow the application developer to choose which output fields are used and which can be ignored. We chose to allow both a basic assign function to populate the data in the output tuple and one that allows for the specification of a default value in cases where the model doesn't produce an output (missing data) for a given set of inputs. This output value can take any SPL expression so it could be built from existing attributes or be a literal value.

Now that we have our design planned out, the next section shows how this design is implemented in the generic operator.


Writing the operator

Generic scoring operator implementation

Now that we have an idea of the overall design for our operator, we will look at the tasks necessary to implement it. First are the mechanics of specifying the interfaces Streams application developers will use. Then we describe the implementation code that uses that information at compile time to produce the necessary code.

Specifying the operator's public interface

To write the generic operator, we will describe the interface to Streams application developers, then describe how the Streams component developer uses this information in the generic operator implementation. A Streams application developer using this operator will need to have the ability to specify the following:

  1. Model files
  2. Model XML metadata
  3. Input mapping
  4. Output mapping

Specifying the model files

To allow the Streams application developer to use this operator with any published predictive model, we need to allow the specification of the published .pim and .par files. We do this by adding parameters and providing the implementation that uses the parameters in the code template. The parameters to specify the .pim and .par files are added to the parameter definition section in the operator XML file, SPSimple.xml, (see Download). Listing 3 shows the definition for the .pim file. The .par file is similar.

Listing 3. Definition for the .pim file
<parameter>
 <name>pimfile</name>
 <description></description>
 <optional>false</optional>
 <rewriteAllowed>true</rewriteAllowed>
 <expressionMode>Expression</expressionMode>
 <type>rstring</type>
 <cardinality>1</cardinality>
</parameter>

Specifying the model XML metadata

Many of the characteristics needed by the code (field names, number, order, type) as well as the field tags used to indicate the input and output fields are described in the predictive model's XML metadata. For our generic operator, we use a parameter to pass the XML file.

Listing 4. XML metadata
<parameter>
 <name>xmlfile</name>
 <description></description>
 <optional>false</optional>
 <rewriteAllowed>true</rewriteAllowed>
 <expressionMode>Expression</expressionMode>
 <type>rstring</type>
 <cardinality>1</cardinality>
</parameter>

Specifying the input mappings

The parameters to specify the input stream attributes and the model input fields are added to the parameter definition section in the operator XML, as shown below.

Listing 5. Parameter definition section in operator XML
<parameter>
 <name>modelFields</name>
 <description docHref="" sampleUri="">a list of strings naming the model 
       input fields</description>
 <optional>false</optional>
 <rewriteAllowed>false</rewriteAllowed>
 <expressionMode>Expression</expressionMode>
 <type>rstring</type>
 <cardinality>-1</cardinality>
</parameter>
<parameter>
 <name>streamAttributes</name>
 <description docHref="" sampleUri="">a list of expressions that define the data
        derived from stream attributes to pass on each model input field.</description>
 <optional>false</optional>
 <rewriteAllowed>false</rewriteAllowed>
 <expressionMode>Expression</expressionMode>
 <cardinality>-1</cardinality>
</parameter>

Specifying the output mappings

The mapping of the output fields to output tuple attributes will be done in the output clause of the SPL operator invocation. To support this, we need to define the custom output functions. We define three functions: a default that does not use the model outputs and two variations of extracting the model output and populating the output tuple attribute. The first population assumes that if the model's execution did not produce a value for this field, the output tuple will contain a default value for that attribute type (for an integer a value of 0). The second allows both the specification of the field and an additional value representing a SPL expression to produce a default value. Listing 6 shows the output function specifications.

Listing 6. Output function specifications
<customOutputFunctions>
 <customOutputFunction>
  <name>SPOutputs</name>
  <function>
   <description sampleUri="">Return the argument unchanged</description>
   <prototype><![CDATA[<any T> T defaultFromModel(T)]]></prototype>
  </function>
  <function>
   <description sampleUri="">return the field from the model output buffer</description>
   <prototype><![CDATA[<any T> T fromModel(rstring)]]></prototype>
  </function>
  <function>
   <description sampleUri="">return the field from the model output buffer if it 
         doesn't exist set a default</description>
   <prototype><![CDATA[<any T> T fromModel(rstring, T)]]></prototype>
  </function>
 </customOutputFunction>
</customOutputFunctions>

Implementing

Now that the interfaces have been specified, we will describe how the Streams component developer uses this information in the generic operator implementation. The generic operator implementation will be described for the following areas:

  1. Common — Perl code used in the header code-generation template and the CPP code-generation template
  2. Header file — The header code-generation template implementation area
  3. Functions — The CPP code-generation template implementation of the functions passed on the Solution Publisher interface
  4. Constructor — The CPP code-generation template implementation of the operator's constructor related to initializing the Solution Publisher interface
  5. Process — The CPP code-generation template implementation of the code called in the process method when an input tuple arrives on this operators input port

Common use of the model XML metadata

Since the information we obtain from the XML metadata will be used in both H and CPP code-generation templates, I chose to implement a common routine named SP_Common.pm to validate and parse the XML metadata document. The SPCommon.pm file contains a function ($infilename, $outfilename, @infields, @infieldtypes, $numinfields, @outfields, @outfieldtypes, $numoutfields) processModelFile($model) This will verify the existence and format of the model file parameter value, and parse the model file to find the input filename, output filename needed on the SP API calls. It also returns information for the input and output fields defined in the model — specifically, the number of each and lists of field names and types. This common routine gets called during the header and CPP template processing steps. Below is an excerpt from the Perl code in this common function.

Listing 7. Perl excerpt
# Get the value for the model parameter, the name of the model file.
# Note that if this is a relative path name, it is rooted in the "data"
# subdirectory of this directory, so we will need to compensate.
  my $parm_modelfile = 
      $model->getParameterByName('xmlfile')->getValueAt(0)->getSPLExpression();
  my $compile_modelpath;
  ($compile_modelpath) = ($parm_modelfile=~m/"(.*)"$/);
  # Absolute path case
  if (substr($compile_modelpath, 0, 1) ne "/") {
    $compile_modelpath = "data/" . $compile_modelpath;
  }

  # Convert to an absolute path
  $compile_modelpath = realpath($compile_modelpath);

  # Check that the model file is readable
  -f $compile_modelpath or 
      SPL::CodeGen::exitln("Model file ". $compile_modelpath. 
        " does not exist or is not a regular file");
  -r $compile_modelpath or 
      SPL::CodeGen::exitln("Model file ". $compile_modelpath. " is not readable");

  # Parse model file to get predictive model information
  my $xs1 = new XML::Simple;
  my $doc = XMLin($compile_modelpath, forcearray => [ qw(field) ], keyattr => [] );
  
  my $infilename; 
  $infilename = $doc->{inputDataSources}->{inputDataSource}->{name};
  
  ... lines omitted

You can see that first, the parameters value of a filename is adjusted and validated. Then we parse the XML contents to extract the necessary information in a usable form for the Perl processing in the code-generation templates. In the example code above, we only show the extracting of the first input data source filename that the model expects. Similar code for extracting the other information necessary is omitted, but can be found in the ZIP file.

The header template code SPSimple_h.cgt calls this function in the SP_Common.pl file to validate and populate the model variables as shown below.

Listing 8. SPCommon.pl file
# Process the model file, extract the field column names,
# and validate these with the operator parameter values.
my ( $infilename_ref, 
 $outfilename_ref, 
 $infields_ref,
 $infieldtypes_ref, 
 $numinfields_ref, 
 $outfields_ref, 
 $outfieldtypes_ref, 
 $numoutfields_ref) = SP_Common::processModelFile($model);

 my $infilename = ${$infilename_ref};
 my $outfilename = ${$outfilename_ref};
 my @infields = @{$infields_ref};
 my @infieldtypes = @{$infieldtypes_ref};
 my $numinfields = ${$numinfields_ref};
 my @outfields = @{$outfields_ref};
 my @outfieldtypes = @{$outfieldtypes_ref};
 my $numoutfields = ${$numoutfields_ref};

This results in local Perl variables that are available to the code-generation templates in an easy-to-use fashion.

Header implementation

In the header template, we use the output field related values to create the data structure needed for the output fields.

Listing 9. Header implementation
// Create a structure to match output row data 
typedef struct {
  void* next;
<%
  for(my $j=0; $j<$numoutfields; $j++) {
    if((@outfieldtypes[$j] eq 'long') or 
      (@outfieldtypes[$j] eq 'time') or 
      (@outfieldtypes[$j] eq 'integer') or 
      (@outfieldtypes[$j] eq 'date') or
      (@outfieldtypes[$j] eq 'timestamp')) {
         print "    long long $outfields[$j];\n";
    }
    elsif (@outfieldtypes[$j] eq 'real') {
      print "    double $outfields[$j];\n";
    }
    elsif (@outfieldtypes[$j] eq 'string') {
      print "    const char * $outfields[$j];\n";
    }
    else {
      SPL::CodeGen::errorln("Model output field $outfields[$j] of type:
        @outfieldtypes[$j] is not a valid type");  
    }
    print "    boolean _missing_$outfields[$j];\n";
  }
 %>	
} outBuffer;

Function implementation

In the SPSimple_cpp.cgt template, we use the output field information in the next record back function called from the Solution Publisher runtime to provide the addresses of the returned data to the operator so it can copy the data into the output tuple.

Listing 10. Function implementation
<%
  for(my $j=0; $j<$numoutfields; $j++) {
    print "    if (row[$j]) {\n";
    if((@outfieldtypes[$j] eq 'long') or 
      (@outfieldtypes[$j] eq 'time') or 
      (@outfieldtypes[$j] eq 'integer') or 
      (@outfieldtypes[$j] eq 'date') or
      (@outfieldtypes[$j] eq 'timestamp')) {
        print "      obp->$outfields[$j] = *((long long *) row[$j]);\n";
    }
    elsif (@outfieldtypes[$j] eq 'real') {
      print "    obp->$outfields[$j] = *((double *) row[$j]);\n";
    }
    elsif (@outfieldtypes[$j] eq 'string') {
      print "    obp->$outfields[$j] = (const char *) row[$j];\n";
    }
    else {
      SPL::CodeGen::errorln("Unsupported output field type of: @outfieldtypes[$j]");
    }
    print "      obp->_missing_$outfields[$j] = false;\n";
    print "    } else {obp->_missing_$outfields[$j] = true;}\n";
  } 
%>

Constructor implementation

The CPP template code SPSimple_cpp.cgt constructor directly uses some of the parameter values supplied (pimfile, for example), as well as the values obtained from the common routine's processing of the XML metadata. Listing 11 shows how the parameter information is used directly in the CPP's constructor.

Listing 11. Constructor implementation
<%
   my $pimfileParam = $model->getParameterByName("pimfile");
   my $pimfile = $pimfileParam->getValueAt(0)->getCppExpression();
%>
rstring pimFile = <%=$pimfile%>;

Later in the constructor, this pimfile value will be passed to one of the Solution Publisher API calls.

Listing 12. Passing the pimfile value
/* open the image */
     int res, status = EXIT_FAILURE;
     image_handle = 0;
     res = clemrtl_openImage(pimFile.c_str(),parFile.c_str(), &image_handle);
     if (res != CLEMRTL_OK) {
		status = EXIT_FAILURE;
		SPLLOG(L_ERROR, "Open Image Failed", "MPK");
		displayError(res, 0);
      }

Likewise, the values returned from the common routine are used in Solution Publisher API calls.

Listing 13. Solution Publisher API call
/* Get Input field count and types */
  	char* key="<%=$infilename%>";
  	SPLLOG(L_INFO, "About to get field count", "MPK");
  	res = clemrtl_getFieldCount(image_handle, key, &fld_cnt );

Process implementation

This implementation within the process method has two main areas:

  1. The code necessary to use the incoming tuple's values to populate the field structure for the Solution Publisher call to execute the predictive model
  2. The code to take the results from the execution of the model and move the appropriate output fields to the operators output tuple and submit it on the output port

Processing the input tuple

Listing 14 shows the use in populating the structure from the input tuple.

Listing 14. Processing the input tuple
<%
 my $oport = $model->getOutputPortAt(0);
 #### first check for match of # infield in model vs modelFields param 
 #### vs steamAttributes param 
 my $modelFieldsParam = $model->getParameterByName("modelFields");
  	  my $modelFieldsSize = $modelFieldsParam->getNumberOfValues();
  my $streamAttributesParam = $model->getParameterByName("streamAttributes");
  	  my $streamAttributesSize = $streamAttributesParam->getNumberOfValues();
  	  
 if ($numinfields ne $streamAttributesSize) {
   SPL::CodeGen::errorln("Number of input ports required by model is: 
       $numinfields; Number of stream attributes provided is: 
       $streamAttributesSize",$oport->getSourceLocation());
 }
 if ($numinfields ne $modelFieldsSize) {
   SPL::CodeGen::errorln("Number of input ports required by model is: 
     $numinfields; Number of model fields provided is: 
     $modelFieldsSize",$oport->getSourceLocation());
 }
 
 #### build up a searchable array of model fields  
 
 my @modelFieldValues;
 for (my $i=0; $i <getValueAt($i);
   my $splexp = $exp->getSPLExpression();
   $splexp =~ s/^"(.*)"$/$1/;
   $modelFieldValues[$i] = $splexp;     
 } 
 
 #### go through each input field, find the modelField and it's corresponding 
 #### streamAttribute expression value, check for compatibility and 
 #### generate the proper storage definition and assignment to input buffer pointer.   
 for(my $j=0; $j<getSourceLocation());
   }
   #### find the stream attribute and type
   my $sexp =  $streamAttributesParam->getValueAt($index);
   my $scppexp = $sexp->getCppExpression();
   my $sadaptedcpp = SPL::CodeGen::adaptCppExpression($scppexp, "tp");
   my $scpptyp = $sexp->getCppType();
   my $ssplexp = $sexp->getSPLExpression();
   
   
   if((@infieldtypes[$j] eq 'long') or 
   	 (@infieldtypes[$j] eq 'time') or 
   	 (@infieldtypes[$j] eq 'integer') or 
   	 (@infieldtypes[$j] eq 'date') or
   	 (@infieldtypes[$j] eq 'timestamp')) {
   	   if ($scpptyp ne 'SPL::int64') {
            SPL::CodeGen::errorln("stream attribute $ssplexp of type: 
              $scpptyp is not compatible with model input field 
              $infields[$j] of type: $infieldtypes[$j]",$oport->getSourceLocation());
       }
	}
	elsif (@infieldtypes[$j] eq 'real') {
	  print "    // real check if compatible streamtype = $scpptyp \n";
	  if ($scpptyp ne 'SPL::float64') {
            SPL::CodeGen::errorln("stream attribute $ssplexp of type: 
              $scpptyp is not compatible with model input field $infields[$j] of type: 
              @infieldtypes[$j]",$oport->getSourceLocation());
      }
	}
	elsif (@infieldtypes[$j] eq 'string') {
	  print "    // string check if compatible streamtype = $scpptyp \n";
	  if ($scpptyp ne 'SPL::rstring') { 
            SPL::CodeGen::errorln("stream attribute $ssplexp of type: 
              $scpptyp is not compatible with model input field $infields[$j] of type: 
              @infieldtypes[$j]",$oport->getSourceLocation());
      }
	}
	else {
	    # this should never occur
	    SPL::CodeGen::errorln("Model input field $infields[$j] of type: 
	      @infieldtypes[$j] is not a valid type",$oport->getSourceLocation());  
	}
	
  #### now generate the right code
  
  if ($scpptyp ne 'SPL::rstring') {
    print "const $scpptyp & local_$j = $sadaptedcpp ;\n";
  	print "myBuf.row[0] [$j] = (void*) &(local_$j);\n";
  } else {
   	print "const char* local_$j =  $sadaptedcpp.c_str();\n";
   	print "myBuf.row[0] [$j] = (void*) (local_$j);\n";
   }
 }   #end of for loop 
%>

You can see some validation of the parameter values taking place, comparing the size of the lists for tuple attribute values to input model fields and then the number of ports expected by the operator. Then the actual mapping of stream attribute to model input field is done checking for valid type matching. Finally, the C++ code to load the storage location accessible during the model execute is populated from the input tuple.

Processing the output fields

Listing 15 shows the use in retrieving the output fields from the model and populating the output tuple.

Listing 15. Processing the output fields
<%
  my $oport = $model->getOutputPortAt(0);
  my $numAttrs = $oport->getNumberOfAttributes();
  for (my $i=0; $i < $numAttrs; $i++) {
    my $attribute = $oport->getAttributeAt($i);
    my $name = $attribute->getName();
    if ($attribute->hasAssignmentWithOutputFunction()) {
      my $of = $attribute->getAssignmentOutputFunctionName();
      if ($of eq 'fromModel') {
        my $exp = $attribute->getAssignmentOutputFunctionParameterValueAt(0);
        my $splexp = $exp->getSPLExpression();
        $splexp =~ s/^"(.*)"$/$1/; #strip off the enclosing double quotes 
        $splexp =~ tr/-$ ./_/;  # convert invalid chars to _ for valid c++ expression
        print "if (currentOutBuf->_missing_$splexp == true) {\n"; # if missing  
        # now check if optional default value expressions was provided
        my $exp2 = $attribute->getAssignmentOutputFunctionParameterValueAt(1);
        my $cppexp2;
        	     
        if ($exp2 ne '') {
          $cppexp2 = $exp2->getCppExpression();
          my $adaptedcpp2 = SPL::CodeGen::adaptCppExpression($cppexp2, "tp");
          print "SPLLOG(L_INFO, \"setting $splexp\", \"MPK\");\n";
          print "SPLLOG(L_INFO, \"default expression is $adaptedcpp2\", \"MPK\");\n";
          print "  otuple.set_$name($adaptedcpp2);\n";
        } 
        print " } else {\n"; # not missing
        print "  otuple.set_$name(currentOutBuf->$splexp);\n"; 
        print " } \n";
      }  
    } 
  }
%>

You can see that the output tuple attribute assignments are evaluated and for those that used the custom output function to indicate they are populated from the model's output, the C++ assignment statement to the tuple attribute is generated.

The implementation of the code-generation templates needed in the generic operator is complete. Next, we will show how the operator is used in an SPL application.


Using the operator

Using the scoring model operator in an InfoSphere Streams application

Again as in Part 1, we will use a simple SPL application to demonstrate integrating a predictive model into a Streams application. We use a file containing rows to be scored and use the InfoSphere Streams FileSource operator to read in the information and produce a stream of these tuples. We also write the scored tuples to an output file one tuple at a time using the InfoSphere Streams FileSink operator.

We use the same basket rule model from Part 1 with two inputs: gender (a string value of M or F) and income (an integer value), and produces an output of string M or F for whether this input indicates a preference of purchasing a combination of beer, beans, and pizza. It also produces a floating-point number representing the confidence of that prediction. There is one slight difference as this time in the input data file: Rather than having a single value for income, we have a file that contains two values — a base salary and a bonus salary that must be added together to produce the desired income input value required by the model.

Running the sample SPL application

Requirements and setup

  1. In order to build and run the sample application, you need a functioning InfoSphere Streams environment.
  2. You need the IBM SPSS Modeler Solution Publisher Runtime 14.2 fixpack 1, plus the Solution Publisher hot fix (scheduled to be available 14 Oct 2011) installed in this environment.
  3. You also need to ensure that the LD_LIBRARY_PATH is set on all systems that the Streams operator will be deployed to contain the necessary Solution Publisher libraries.

LD_LIBRARY_PATH requirement

Assuming Solution Publisher is installed in $INSTALL_PATH, the LD_LIBRARY_PATH needs to include:

  • $INSTALL_PATH
  • $INSTALL_PATH/ext/bin/*
  • $INSTALL_PATH/jre/bin/classic
  • $INSTALL_PATH/jre/bin

A script included in the ZIP file named ldlibrarypath.sh is provided to set the path up correctly. If Solution Publisher is not installed in the default location, change the first line of the script to point to your Solution Publisher install directory before using the script. For example, if Solution Publisher is installed in /homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.2, then set:

CLEMRUNTIME=/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.2

Sample contents

The sample ZIP file contains the same .pim, .par, and XML files from the Market Basket Analysis sample with a scoring branch added, a sample input and expected output files, a complete Streams Programming Language application Main.spl, and the generic operator SPSimple that scores the Market Basket Analysis model.

We provide a simple SPL application in com.ibm.mpk/SPGeneric/Main.spl that looks like Figure 1.

Figure 1. SPL application
Image shows SPL application with FileSource, SPSimple operator and FileSink

Adjusting and compiling the sample

To run the sample SPL application, unzip the SPGeneric.zip (see Download) file to your Linux system that has InfoSphere Streams and Solution Publisher installed. If the Solution Publisher install location is different from the default value of /opt/IBM/SPSS/ModelerSolutionPublisher/14.2, modify the operator XML file (SPSimple.xml) in the com.ibm.mpk/SPSimple directory. You need to change the libPath and includePath entries to match your Solution Publisher install location:

<cmn:libPath>/opt/IBM/SPSS/ModelerSolutionPublisher/14.2</cmn:libPath>
<cmn:includePath>/opt/IBM/SPSS/
    ModelerSolutionPublisher/14.2/clemrtl/include</cmn:includePath>

You also need to modify the Main.spl file by adding the SP_Install parameter on the operator's invocation.

Listing 16. Modifying Main.spl
stream<DataSchemaPlus> scorer = SPSimple(data){
  param 
    pimfile: "baskrule.pim";
    parfile: "baskrule.par";
    xmlfile: "baskrule.xml";
    SP_Install: "/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.2";
    modelFields: "sex","income";
    streamAttributes: s_sex, baseSalary+bonusSalary;
    	  
  output 
    	  scorer: 
    	    income = fromModel("income"),
    	    predLabel = fromModel("$C-beer_beans_pizza"), 
    	    confidence = fromModel("$CC-beer_beans_pizza");	
    }

Take note of a few other things in the SPL program.

Listing 17. Input values to model
modelFields: "sex","income";
    	  streamAttributes: s_sex, baseSalary+bonusSalary;

The s_sex attribute is used as the first input value to the model and the sum of baseSalary and bonusSalary are used as the other input value. Also note the following.

Listing 18. Output tuple
output scorer: 
    	    income = fromModel("income"),
    	    predLabel = fromModel("$C-beer_beans_pizza"), 
    	    confidence = fromModel("$CC-beer_beans_pizza");

The income value produced in the model (based on the input value) is added to the output tuple, along with the predicted value and confidence.

Compiling the sample

To compile the sample as a stand-alone streams application, change directory to where you unzipped the sample project (SPGeneric) and run the make command.

Listing 19. Compiling the sample
bash-3.2$ cd STSP2Test/SPGeneric/
bash-3.2$ make
/homes/hny1/koranda/InfoSphereStreams64/bin/sc 
  --output-directory="output/com.ibm.mpk.Main/Standalone" --data-directory="data" -T 
  -M com.ibm.mpk::Main  --no-toolkit-indexing --no-mixed-mode-preprocessing 
Creating types...
Creating functions...
Creating operators...
Creating PEs...
Creating standalone app...
Creating application model...
Building binaries...
 [CXX-type] tuple<string s_sex,int64 
       baseSalary,int64 bonusSalary,in...,float64 confidence>
 [CXX-operator] data
 [CXX-operator] scorer
 [CXX-operator] Writer
 [CXX-type] tuple<rstring s_sex,int64 baseSalary,int64 bonusSalary>
 [CXX-pe] pe0
 [CXX-standalone] standalone
 [LD-standalone] standalone
 [LN-standalone] standalone 
 [LD-pe] pe0
bash-3.2$

Executing the sample

To execute the application, make sure you have set the LD_LIBRARY_PATH. See the command to set the path and echo the path back to visually verify that it was set correctly.

Listing 20. Executing the sample
bash-3.2$ source ldlibrarypath.sh 
bash-3.2$ echo $LD_LIBRARY_PATH
/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.2:/homes/hny1/koranda/IBM
/SPSS/ModelerSolutionPublisher64P/14.2/ext/bin/pasw.adp:/homes/hny1/koranda/IBM/SPSS/
ModelerSolutionPublisher64P/14.2/ext/bin/pasw.alm:/homes/hny1/koranda/IBM/SPSS/Modeler
SolutionPublisher64P/14.2/ext/bin/pasw.bagging:/homes/hny1/koranda/IBM/SPSS/ModelerSo
lutionPublisher64P/14.2/ext/bin/pasw.boosting:/homes/hny1/koranda/IBM/SPSS/ModelerSolu
tionPublisher64P/14.2/ext/bin/pasw.cognos:/homes/hny1/koranda/IBM/SPSS/ModelerSolutio
nPublisher64P/14.2/ext/bin/pasw.common:/homes/hny1/koranda/IBM/SPSS/ModelerSolution
Publisher64P/14.2/ext/bin/pasw.me:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublish
er64P/14.2/ext/bin/pasw.netezzaindb:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPubl
isher64P/14.2/ext/bin/paswneuralnet:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPubli
sher64P/14.2/ext/bin/pasw.outerpartition:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionP
ublisher64P/14.2/ext/bin/pasw.pmmlmerge:/homes/hny1/koranda/IBM/SPSS/ModelerSolutio
nPublisher64P/14.2/ext/bin/pasw.psm:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPub
lisher64P/14.2/ext/bin/pasw.scoring:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPubli
sher64P/14.2/ext/bin/pasw.split:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher
64P/14.2/ext/bin/pasw.transformation:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPub
lisher64P/14.2/ext/bin/pasw.tree:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher
64P/14.2/ext/bin/pasw.vi:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.
2/ext/bin/pasw.xmldata:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.2
/ext/bin/spss.bayesiannetwork:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher
64P/14.2/ext/bin/spss.binning:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher6
4P/14.2/ext/bin/spss.C5:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.
2/ext/bin/spss.inlinecsp:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.
2/ext/bin/spss.knn:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.2/ex
t/bin/spss.modelaccreditation:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher6
4P/14.2/ext/bin/spss.modelevaluation:/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPub
lisher64P/14.2/ext/bin/spss.predictoreffectiveness:/homes/hny1/koranda/IBM/SPSS/Model
erSolutionPublisher64P/14.2/ext/bin/spss.predictorstat:/homes/hny1/koranda/IBM/SPSS/M
odelerSolutionPublisher64P/14.2/ext/bin/spss.propensitymodelling:/homes/hny1/koranda/I
BM/SPSS/ModelerSolutionPublisher64P/14.2/ext/bin/spss.psmmodel:/homes/hny1/koranda
/IBM/SPSS/ModelerSolutionPublisher64P/14.2/ext/bin/spss.selflearning:/homes/hny1/kora
nda/IBM/SPSS/ModelerSolutionPublisher64P/14.2/ext/bin/spss.svm:/homes/hny1/koranda
/IBM/SPSS/ModelerSolutionPublisher64P/14.2/ext/bin/spss.xd:/homes/hny1/koranda/IBM/S
PSS/ModelerSolutionPublisher64P/14.2/jre/bin/classic:/homes/hny1/koranda/IBM/SPSS/Mo
delerSolutionPublisher64P/14.2/jre/bin
bash-3.2$

Next, change directory to the data directory under the SPGeneric directory and execute the stand-alone program, as shown below:

bash-3.2$ cd data/
bash-3.2$ ../output/com.ibm.mpk.Main/Standalone/bin/standalone

Since the SPL program specified a verbosity of INFO, you should see a number of informational messages showing the processing similar to the following.

NOTE: The timestamp, process, class, method, and line information has been deleted for brevity and readability.

22 Aug 2011 15:39:53.571 [21029] 
INFO spl_pe M[PEImpl.cpp:process:483]   - Start processing...
Listing 21. Informational messages
bash-3.2$ ../output/com.ibm.mpk.Main/Standalone/bin/standalone 
- Start processing...
- Using a pimfile value of: "baskrule.pim"
- Using a parfile value of: "baskrule.par"
- About to clemrtl initialise using SP_Install of: "/homes/hny1/koranda/IBM/
      SPSS/ModelerSolutionPublisher64P/14.2"
- After clemrtl initialise
- Major=14 Minor=2 Release=0 build=0
- Image Handle Retrieved: 1
- About to get field count
- Field Count is: 2
- About to get Field Types
- Field Type 0 is: STRING
- Field Type 1 is: LONG
- About to get Output Field Count
- Output Field Count is: 4
- About to get output field types
- Field Type 0 is: STRING
- Field Type 1 is: LONG
- Field Type 2 is: STRING
- Field Type 3 is: DOUBLE
- About to set alternative Input
- After Set Alternative input
- About to set alternative output
- After Set Alternative Output
 - About to prepare
- After Prepare
- Leaving Constructor
 - Opening ports...
- Opened all ports...
- Creating 1 operator threads
 - Created 1 operator threads
- Joining operator threads...
- Processing tuple from input port 0 {s_sex="F",baseSalary=5000,bonusSalary=5000}
 - About to execute the image
- In next_record iterator
- In next_record iterator
- After Execute
- Sending tuple to output port 0 {s_sex="F",baseSalary=5000,
     bonusSalary=5000,income=10000,predLabel="F",confidence=0.988327}}
...
lines omitted
...
- Joined all operator threads...
- Joining window threads...
- Joined all window threads.
- Joining active queues...
- Joined active queues.
- Closing ports...
- Closed all ports...
- Notifying operators of termination...
 - Notified all operators of termination...
- Flushing operator profile metrics...
- Flushed all operator profile metrics...
- Deleting active queues...
- Deleted active queues.
- Deleting input port tuple cache...
- Deleted input port tuple cache.
- Deleting all operators...
- About to Close Image with handle1
- After Close Image
- Deleted all operators.
- Terminating...
- Completed the standalone application processing
- Leaving MyApplication::run()
- Shutdown request received by PE...
- shutdownRequested set to true...

To see all the results produced, look in the mpkoutput.csv file created by the FileSink. Note that the output file contains the input values for base salary and bonus salary, as well as the combined income value output from the model.

Listing 22. Results
bash-3.2$ cat mpkoutput.csv 
"F",5000,5000,10000,"F",0.988326848249027
"F",15000,5000,20000,"F",0.989645351835357
"F",10000,5000,15000,"F",0.988326848249027
"F",20000,5000,25000,"F",0.989645351835357
"F",5000,5000,10000,"F",0.988326848249027
"F",15000,5000,20000,"F",0.989645351835357
"F",10000,5000,15000,"F",0.988326848249027
"F",20000,5000,25000,"F",0.989645351835357
"M",5000,5000,10000,"T",0.838323353293413
"M",15000,5000,20000,"F",0.990963855421687
"M",10000,5000,15000,"T",0.838323353293413
"M",20000,5000,25000,"F",0.990963855421687
"M",5000,5000,10000,"T",0.838323353293413
"M",15000,5000,20000,"F",0.990963855421687
"M",10000,5000,15000,"T",0.838323353293413
"M",20000,5000,25000,"F",0.990963855421687
bash-3.2$

Hooray!

You have now seen how to convert a single-use operator to a generic operator capable of being tailored by a Streams application developer to score many different SPSS predictive models.


Summary

Results

This tutorial has shown how you can use a generic operator to wrap the execution of any SPSS Modeler predictive analytic (compatible with the Solution Publisher API restrictions and makes sense to execute a tuple at a time). This generic operator can be easily used by a Streams application developer to execute the predictive model against the streaming data.

Note there are other ways to execute scoring models in InfoSphere Streams through PMML and the Streams Mining toolkit. The direct wrapping technique and integration with SPSS models through the Solution Publisher interface provided here opens the scoring up to a much larger set of models than what are supported through the PMML integrations of the Mining Toolkit.


Download

DescriptionNameSize
Sample generic operator, program, models and dataSPGeneric.zip26KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Big data and analytics
ArticleID=763045
ArticleTitle=Integrating SPSS Model Scoring in InfoSphere Streams, Part 2: Using a generic operator
publish-date=10062011