This tutorial shows how to create a generic operator to execute SPSS predictive models in an InfoSphere Streams application. It builds off the work of the non-generic operator produced in Part 1, where we wrote and used an InfoSphere Streams operator to execute an IBM SPSS Modeler predictive model in an InfoSphere Streams application using the IBM SPSS Modeler Solution Publisher Runtime library API.
In Part 1, we developed an operator that wrapped a specific predictive model and would only work with the exact schema as shown in the example developed. Recall that we showed how that non-generic operator could be modified to accommodate different models and schemas, but that required at least some C++ programming skill to accomplish the necessary adjustments. Creating a generic operator that can automatically adjust itself to different inputs and different models will allow integration to be totally accomplished by a Streams application programmer, eliminating the work and skill needed to create individual operators for each different model to be used.
The roles and terminology are provided in Part 1 and are not repeated here. Our focus here is on the Streams component developer role and how to write a generic operator that can be tailored by a Streams application developer to execute SPSS predictive models. For information about the other roles to understand the work and interaction necessary for the overall solution, refer to Part 1.
Recall from Part 1 that in order to write the Streams operator, the Streams component developer needs to know certain information about the inputs and outputs of the predictive model produced by the data analyst. Specifically, the operator developer will require:
- Install location of Solution Publisher
- The .pim and .par files produced during the publish
- The input source node key name. This can be found in the XML
<inputDataSources> <inputDataSource name="file0" type="Delimited">
NOTE: While there is no technical limitation, our example is limited to supporting a single input source for simplicity.
- The input field names and storage and their order as found inside
<fields> <field storage="string" type="discrete"> <name>sex</name> </field> <field storage="integer" type="range"> <name>income</name> </field> </fields>
- The output node or terminal node key name.
<outputDataSources> <outputDataSource name="file3" type="Delimited">
NOTE: While there is no technical limitation, our example is limited to supporting a single output node for simplicity.
- The output field names and storage and their order as found inside
<fields> <field storage="string" type="discrete"> <name>sex</name> </field> <field storage="integer" type="range"> <name>income</name> </field> <field storage="string" type="flag"> <name>$C-beer_beans_pizza</name> ... </field> <field storage="real" type="range"> <name>$CC-beer_beans_pizza</name> ... </field> </fields>
Also recall that to adjust the sample operator to work for a different input tuple/model combination, you needed to adjust it in the following spots:
- _h.cgt file
Adjust the output structure
- _cpp.cgt file
- next_record_back —
- Adjust the output structure
- Adjust the .pim and .par filename and locations
- Adjust the input and output file tags
- Adjust the input field pointer array size
- Adjust load input structure code
- Adjust load output tuple code
Next, we describe the design of our generic operator that will use the metadata and operator's parameters to automatically generate the right code that in the non-generic operator needed to manual adjustment.