Designing the operator
At a conceptual level, what the generic operator needs to do is provide parameters that a Streams application developer can use to tailor the operators use with different predictive models with differing input and output tuple formats. In this section, we will describe the parameters our generic operator provides and the design decisions that led to this set of parameters.
Our generic operator design starts with the decision to continue to only support a single input port and a single output port.
Recall that the .pim and .par filenames in our non-generic operator were hard coded in the .cgt file. To make our generic operator able to accept any user-specified model files, these need to be passed into the operator. Parameters will be used to allow their specification similar to how the SPInstall location was handled in the non-generic operator.
Recall that much of the information a Streams component developer needed to build the non-generic operator was obtained by examining the predictive model's XML metadata file. This file will be programatically read and used to dynamically generate the necessary operator code when an application using the generic operator is compiled. We will use a parameter similar to those used for the .pim and .par files to allow the XML metadata file specification.
To allow a Streams application developer to use the generic operator, a key requirement is the ability to specify which input tuple data is used to feed the input fields of the model. In our simple generic model, we hard-coded this and required that the tuple's attributes were compatible with the format expected by the models input fields. For our generic operator, we need to allow the application developer to indicate how to populate these input fields from the appropriate input data. To accomplish this, we define two parameters, each a list. The first list will contain SPL expressions that indicate the input values derived from the input tuple data to be used to populate the model's input fields. The second list indicates which model input field is populated from the corresponding expression in the first list.
By allowing SPL expressions, we allow greater flexibility in specifying the data to be used. In the simplest case, it would be a tuple attribute in the proper format — the same as was used in the non-generic operator. But by allowing a SPL expression, it can now be a complex expression made from several attributes to provide the correct input data needed. It could be a simple type conversion from the input tuple data available to the appropriate data type needed by the model, a complex expression that uses several attributes to derive the needed input field value, or it could be a literal value for cases where the model requires data that is not in the input tuple and a default value will suffice. While we could have gotten by with a single list and required the application developer to ensure that the expressions were listed in the exact order needed for the model, we thought that to be too error-prone and chose to require the application developer to provide the second parameter, which is an explicit list of the mapping to the existing field names. The second list parameter serves this purpose.
A similar set of information is needed to map from the model's output fields to the output tuple. We felt the most natural Streams implementation would be to specify this through the SPL output clause and the use of a custom output function in the output port of the operator. By doing this, we allow the application developer to choose which output fields are used and which can be ignored. We chose to allow both a basic assign function to populate the data in the output tuple and one that allows for the specification of a default value in cases where the model doesn't produce an output (missing data) for a given set of inputs. This output value can take any SPL expression so it could be built from existing attributes or be a literal value.
Now that we have our design planned out, the next section shows how this design is implemented in the generic operator.