Building the predictive models
Determining what data is available, what models are appropriate, and building those models is typically done by a data analyst.
This tutorial will build on the Market Basket Analysis demo and tutorial shipped with the SPSS Modeler Client. We will show how the model scoring stream built in Modeler can be published and used in an InfoSphere Streams application to enable real-time scoring and decision leveraging a SPSS model's logic.
The Market Basket Analysis example deals with fictitious data describing the contents of supermarket baskets (collections of items bought together) plus the associated personal data of the purchaser, which might be acquired through a loyalty card program. The goal is to discover groups of customers who buy similar products and can be characterized demographically, such as by age, income, etc.
This example reveals how IBM SPSS Modeler can be used to discover affinities, or links, in a database, both by modeling and by visualization. These links correspond to groupings of cases in the data, and these groups can be investigated in detail and profiled by modeling. In the retail domain, such customer groupings might, for example, be used to target special offers to improve the response rates to direct mailings or to customize the range of products stocked by a branch to match the demands of its demographic base.
The building and analyzing of models is beyond the scope of this tutorial. Please refer to the material provided with the SPSS Modeler Client for more information on the data analyst role and modeling in general.
This tutorial starts from a working modeler session as shown in Figure 2.
Figure 1. Basket Rule modeler client
In order to use the predictive model built in this session, you need to create a scoring branch by connecting an input node to the scoring nugget and connecting the scoring nugget to an output flat file as shown below. Note that this branch in our example is quite trivial and not representative of the typical complexity of real modeling applications. It relies on two input fields: gender and income, to predict whether a customer is likely to purchase a combination of beer, beans, and pizza.
Figure 2. Basket Rule scoring branch
We have created a sample input file to test the execution of the model.
Figure 3. Basket Rule sample input file
Running the branch from within Modeler using the sample set of user inputs produces the following file. This sample data can be used to validate the scoring branch in the SPSS Modeler Workbench. It will also be used to validate the streams operator and the application we will build later.
Figure 4. Basket Rule sample output file
It is probably best to save the modified stream above after the scoring
branch has been created and then use a temporary Export node to
publish the artifacts necessary to use the scoring model in the
Streams operator. In the Modeler session, change the output file
setting to something similar to what is shown below and click
NOTE: Be sure to specify to publish the metadata. This will produce an XML document that describes the inputs and outputs of the model, including their field and field types. This is the information necessary to convey to the Streams Component Developer to enable them to build the operator to call this predictive model.
Figure 5. Basket Rule publish dialog
The completed modeler workbench session is provided in the streamsbasketrule.str file (see Download).
In order to write the Streams operator, the Streams component developer needs to know certain information, such as the inputs and outputs of the predictive model. Specifically, the operator developer will require:
- The install location of Solution Publisher.
- The .pim and .par files produced during the publish.
- The input source node key name. This can be found in the XML
<inputDataSources> <inputDataSource name="file0" type="Delimited">
NOTE: While there is no technical limitation, our example is limited to supporting a single input source for simplicity.
- The input field names and storage, and their order as found in
Listing 1. Input field names as found in
<fields> <field storage="string" type="flag"> <name>sex</name> </field> <field storage="integer" type="range"> <name>income</name> </field> </fields>
- The output source node key name. This can be found in the XML
fragment in Listing 2.
Listing 2. Output source node key name as found in XML fragment
<outputDataSources> <outputDataSource name="file3" type="Delimited">
NOTE: While there is no technical limitation, our example is limited to supporting a single output source for simplicity.
- The output field names and storage, and their order as found inside
Listing 3. Output field names as found in
<fields> <field storage="string" type="flag"> <name>sex</name> </field> <field storage="integer" type="range"> <name>income</name> </field> <field storage="string" type="flag"> <name>$C-beer_beans_pizza</name> <flag> <true> <value>T</value> </true> <false> <value>F</value> </false> </flag> </field> <field storage="real" type="range"> <name>$CC-beer_beans_pizza</name> <range> <min> <value>0.0</value> </min> <max> <value>1.0</value> </max> </range> </field> </fields>
The data analyst has also informed the Streams component developer that the model does not modify the input parameters, even though they are listed as outputs on the model. While not critical, this information will allow the operator writer to optimize by not recopying those fields to the output tuple.
Now that we have the model published and the information necessary to write the operator, the next section covers how the Streams component developer goes about producing the operator.