IBM Streams 4.2.1

Namespace com.ibm.streams.mining.scoring

The Mining Toolkit has four toolkit operators. They share a common namespace: com.ibm.streams.mining.scoring.

All four toolkit operators have the following characteristics:
  • All toolkit operators share a common structure.
  • All toolkit operators have similar parameters.
  • Each toolkit operator has one required and one optional input stream.
  • Each toolkit operator produces a single output stream.

Consider the following example, which uses the Classification operator:

type
  InputSchema = int64 a, int64 b;
graph
  stream<InputSchema> Input = FileSource(){
    param
      file : "testdata.csv";
  }
  stream<InputSchema, tuple<rstring predictedClass, float64 confidence>> 
  result = Classification(Input){
    param
      model : "../models/decisiontree.pmml"
      a: "m1";
      b: "m2";
  }

The input stream contains data to score using the PMML model contained in the file that is specified by the model parameter. The mapping between input stream attributes and model parameters needs to be specified explicitly with operator parameters since different names can be used for the learning and scoring data attributes. For example, the model can be built using attribute name Sex while the input stream can have attribute name Gender, both of which refer to the same value. The appropriate mapping needs to be specified. In the listed example, input stream attribute a is mapped to model parameter m1, while input attribute b is mapped to model parameter m2. Note that the output stream contains two attributes, predictedClass and confidence, that are not input stream parameters nor do they have explicit assignments. The Classification operator generates and assigns the values to the two attributes.

Each operator has a required input stream that contains data to score using the PMML model, as noted in the above example. The optional input stream allows an application to update the PMML model that it is using for scoring while it is running. Each time an operator receives a tuple on this input stream, it treats it as the name of a new PMML model and begins to use it for scoring the data stream. This input stream must have a single attribute of type rstring. If the file named by this input stream attribute does not exist, is not readable, or is not a valid PMML document, the operator continues to use the current PMML model for scoring, and an error is logged.

Mapping parameters

These operators use parameter specifications to establish the mapping between PMML mining model parameters and input stream attributes. For each MiningField element in the MiningSchema element of the given PMML document, if its usageType attribute is neither supplementary nor predicted, its name attribute must be the same as the value of an operator parameter, and the name of that operator parameter must be the same as that of an input stream attribute. Each mapping parameter must have exactly one value of type rstring that is an expression which can be evaluated at compile time.

Note: There might be input stream attributes that do not correspond to mining model fields.

The sc compiler issues error messages, one per sc invocation, indicating that a named MiningField does not have a mapping to an input stream attribute. Each of the toolkit operators calculates additional values for each input stream tuple it scores and automatically assigns them to output stream attributes.

Operators

  • Associations: Association rules are represented as [x] => [y] where [x] is the rule body or antecedent, and [y] is the rule head or consequent.
  • Classification: The Classification operator calculates the predicted class and the confidence for each tuple in the input stream and automatically assigns those values to output stream attributes.
  • Clustering: The Clustering operator calculates the cluster index and clustering score for each tuple in the input stream and automatically assigns those values to output stream attributes.
  • Regression: The Regression operator calculates the predicted value and the predicted standard deviation for each tuple in the input stream and automatically assigns those values to output stream attributes.