Representing predictive solutions in PMML

Move from raw data to predictions

PMML, the Predictive Model Markup Language, is the de facto standard used to represent a myriad of predictive modeling techniques, such as Association Rules, Cluster Models, Neural Networks, and Decision Trees. These techniques empower companies around the globe to extract hidden patterns from data and use them to forecast behavior. In this article, start with a look at the predictive modeling techniques that are directly supported by the standard. However, given that a predictive solution is more than the statistical techniques it harbors, then dive even deeper into the language and explore the transformations and functions that are used for data processing by illustrating the use of data pre-processing and modeling in PMML as it is used to represent a complete predictive solution.

Share:

Alex Guazzelli (alex.guazzelli@zementis.com), VP of Analytics, Zementis, Inc.

Photo of Alex GuazzelliDr. Alex Guazzelli is the VP of Analytics at Zementis. Inc. where he is responsible for developing core technology and predictive solutions under ADAPA, a PMML-based decisioning platform. Dr. Guazzelli holds a Ph. D. in Computer Science from the University of Southern California and has recently co-authored the book "PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytic"."



28 September 2010

Also available in Chinese Portuguese Spanish

Introduction to PMML

Today, sensors are becoming ubiquitous, from smart meters in homes to the monitoring of equipment and structures such as deep water oil rigs. To make sense of all the data being gathered from these sensors, predictive analytics needs open standards, which allow for systems to communicate without the impediments of proprietary code and incompatibilities. PMML is the standard used to represent predictive analytic or data mining models. With PMML, a predictive solution can be built in one system and deployed in another where it can be put to work immediately.

For the Petroleum and Chemical industry, predictive maintenance is one application in which raw data captured from sensors can be pre-processed and used to build predictive solutions that can detect a machinery breakdown before it happens. In the wake of the Gulf of Mexico tragedy, predictive analytics and open standards can provide yet another tool for ensuring safety and process reliability.

As the de facto standard to represent predictive solutions, PMML allows model and data transformations to be represented together in a single and concise way. When used to represent all the computations that make up a predictive solution, PMML becomes the bridge not only between data analysis, model building, and deployment systems, but also between all the people and teams involved in the analytical process inside a company. This is extremely important, since it can be used to disseminate knowledge and best practices as well as ensure transparency.


Predictive modeling techniques

This section focuses on all the predictive modeling techniques covered by specific PMML elements. Although a myriad of different techniques are proposed every year, they need to be proven and adopted by a large community of data mining practitioners before becoming part of the standard. As of Version 4.0, released in 2009, PMML offers specific elements for the following modeling or statistical techniques:

  • Association Rules: AssociationModel element
  • Cluster Models: ClusteringModel element
  • Decision Trees: TreeModel element
  • Naïve Bayes Classifiers: NaïveBayesModel element
  • Neural Networks: NeuralNetwork element
  • Regression: RegressionModel and GeneralRegressionModel elements
  • Rulesets: RuleSetModel element
  • Sequences: SequenceModel element
  • Support Vector Machines: SupportVectorMachineModel element
  • Text Models: TextModel element
  • Time Series: TimeSeriesModel element

These techniques allow you to extract patterns from historical data not obvious to the human eye. Association Rules, for example, are frequently used to find out rules or relationships between products in large scale transaction data. When presented with sales data from a supermarket, Association Rules are used to find that customers who purchased items A and B also purchased item C. The information conveyed by Association Rules can then be used to drive marketing activities as well as the placement of products inside a store.

On the other hand, as its name suggests, Cluster Models are used to cluster data into specific buckets based on a pre-determined similarity measure. Cluster Models can be center-based where the cluster center is defined by a data vector or distribution-based where the center is defined by statistics. When put to work, a Cluster Model will assign incoming data to the cluster with the closest center.

Another commonly used modeling technique known as Decision Trees implements a tree-like structure in which the data is partitioned by a series of decision nodes. Leaf nodes define a particular class in case of classification trees. Decision Trees are a favorite for applications in which the rationale behind a predictive decision needs to be explained. The developerWorks article, "What is PMML?", focused on yet another technique, namely Neural Networks, which offers a non-linear way to extract relationships between data fields (see Resources for a link). Independent of the modeling technique used, however, the goal is clear: to be able to find patterns in data or to model complex relationships between the input data and the output they are trying to predict.

A recent trend in predictive analytics is to use a combination of several statistical techniques, also known as a model ensemble, to solve a single problem. In this scenario, each model produces a prediction which is then combined into an overall result. As with the old belief that two heads are better than one, given that different techniques look at data from different mathematical perspectives, their combination can boost predictive performance. To address the use of more than one technique or model to solve a single problem, PMML defines a multiple model element called MiningModel. It offers a series of methods that allow you to combine the output from different models. Examples include majority vote and weighted average.


Working with data in PMML

The model elements discussed above serve as the anchor in which a particular modeling technique is represented in PMML. In fact, each model element encapsulates all the attributes and sub-elements necessary to represent each modeling technique in detail, its parameters, and structure. But, as much as the brain of a predictive solution is its models, its eyes are the data that feed the model with raw and derived input fields. Since PMML is able to represent not only the brain, but also the eyes, it is capable of delivering the power necessary to define an entire predictive solution.

To accomplish this feat, PMML defines a number of elements and attributes, as well as a specific order governing their usage. A PMML file always starts with the elements used for data setup. Once data setup is accomplished, PMML allows for the definition of data pre-processing steps followed by the model itself. Let's take a look at all three of these steps, starting with data setup.

Data setup

PMML specifies a series of elements that are used to define the data fields of interest. The DataDictionary element is used to specify all the raw input data fields being used by a model. Listing 1 shows how a numerical field named pressure is represented in the DataDictionary element. Note that besides type information, it allows for specifying the interval of valid values. In this example, any values lower than 0 or greater than 100 are considered to be invalid.

Listing 1. DataDictionary element in PMML
<DataDictionary>
   <DataField name="pressure" dataType="double" optype="continuous" >
      <Interval closure="closedClosed" 
         leftMargin="0" rightMargin="100" />
   </DataField>
   <!-- Other DataFields -->            
</MiningSchema>

Another PMML element for representing input data is MiningSchema. This element is extremely important whenever a model is deployed and put to work, since it defines what to do in case any of the raw input fields defined in the DataDictionary element are missing or contain invalid values. This element also allows for the treatment of outliers—extreme values associated with a given input field.

In the real-world, far from the system where the model was built, sensors might malfunction, providing distorted information or no information altogether. For these situations, the MiningSchema element provides the exact process to be followed, which substantially increases the robustness of the overall solution. Listing 2 shows the MiningSchema representation for field pressure.

Listing 2. Mining Schema element in PMML
<MiningSchema>
   <MiningField name="pressure" usageType="active" 
      missingValueReplacement="35.32"
      missingValueTreatment="asMean" 
      invalidValueTreatment="asMissing"
      outliers="asExtremeValues"
      lowValue="10"
      highValue="70"/>
    <!-- Other MiningFields -->            
</MiningSchema>

According to this example, if the incoming value is missing, it is replaced with value 35.32, which represents the mean value for this field as computed from the historical data. Also note that any invalid values (lower than 0 or greater than 100—as defined in the DataDictionary in Listing 1) are treated as missing values. If valid values lower than 10 or greater than 70 are encountered though, these are treated as outliers and automatically replaced by the values 10 or 70, respectively.

Data pre-processing

Once the data setup is done, PMML allows for the definition of a vast array of data pre-processing steps. For that, it provides a set of elements for common data transformations as well as a list of built-in functions for the definition of arithmetic and logic operations. The data pre-processing computations are used to boost the predictive power of the raw input data or simply to prepare the data to be presented to the model itself. For example, many modeling techniques only take numerical fields as input. In this case, any categorical input will need to be transformed into numerical input before being used.

PMML provides these elements for pre-processing and data transformation:

  • Normalization: Map continuous and discrete values to numbers.
  • Discretization: Map continuous values to discrete values.
  • Value mapping: Map discrete values to discrete values.
  • Functions: Derive a value by applying a function to one or more parameters.

Listing 3 shows the PMML normalization element NormContinous.

In this example, PMML is used to transform the value of the input field pressure to a value between 0 and 1. Note that the new normalized value is further assigned to a new field, a derived field named normalized_pressure.

Listing 3. Normalization in PMML
<DerivedField name="normalized_pressure" 
   dataType="double" optype="continuous">
   <NormContinuous field="pressure">
      <LinearNorm norm="0" orig="10"/>
      <LinearNorm norm="1" orig="70"/>
   </NormContinuous>
</DerivedField>

In this example, PMML is used to transform the value of the input field pressure to a value between 0 and 1. Note that the new normalized value is further assigned to a new field, a derived field named normalized_pressure.

Normalizations such as these are commonly applied to data fields used as input to a Neural Network model. When building your predictive analytic model using IBM® SPSS® Statistics, you automatically have the choice of exporting it as a PMML file. If you build a Neural Network model, all input fields used by the network will be normalized and the resulting PMML file will incorporate the element NormContinuous for all continuous input fields.

Listing 4 shows the PMML discretization element Discretize.

Listing 4. Discretization in PMML
<DerivedField name="categorical_pressure" 
   dataType="string" optype="categorical">
   <Discretize field="pressure">
      <DiscretizeBin binValue="low">
         <Interval closure="openClosed" rightMargin="25" />
      </DiscretizeBin>
      <DiscretizeBin binValue="normal">
         <Interval closure="openClosed" 
            leftMargin="25" rightMargin="50" />
      </DiscretizeBin>
      <DiscretizeBin binValue="high">
         <Interval closure="openOpen" leftMargin="50" />
      </DiscretizeBin>
   </Discretize>
</DerivedField>

In this example, the numerical input field pressure is binned into three categories (low, normal, and high), which are assigned to a new derived field named categorical_pressure. The first bin maps values up to 25 to low. The second maps values greater than 25 and less than or equal to 50 to normal. The third and last bin maps values greater than 50 to high.

Note that element Discretize defines a set of DiscretizeBin sub-elements that use the Interval element in the same way as the DataDictionary element in Listing 1. The reuse of generic elements inside specialized ones is a common theme in PMML. This makes the language more readable and, for analytical tools, easier to export and import.

Listing 5 shows the PMML mapping element MapValues. In this example, the derived field categorical_pressure created above is used as the input field to the mapping transformation, which creates a field named grouped_pressure. This is a great feature of PMML since it allows derived fields to be created from other derived fields.

Listing 5. Mapping in PMML
<DerivedField name="grouped_pressure" 
   dataType="integer" optype="categorical">
   <MapValues outputColumn="group">
      <FieldColumnPair column="C1" field="categorical_pressure" />
      <InlineTable>
         <row>
            <C1>low</C1>
            <group>1</group>
         </row>
         <row>
            <C1>normal</C1>
            <group>1</group>
         </row>
         <row>
            <C1>high</C1>
            <group>2</group>
         </row>
      </InlineTable>
   </MapValues>
</DerivedField>

Note that, in this case, the MapValues element groups input categories. It uses the element InlineTable to assign the categories low and normal to group 1 and category high to group 2.

PMML also defines many built-in functions for arithmetic and logic operations together with a generic IF-THEN-ELSE function. When combined with other functions, it provides a powerful representation medium for almost any kind of pre-processing task (see Listing 6).

Listing 6. IF-THEN-ELSE function
IF categorical_pressure = "high"
THEN system_pressure = 0.3 * pressure
ELSE system_pressure = 2 ^ (1 + log (0.34* pressure + 1)

Listing 6a shows the PMML equivalent of the operation in Listing 6.

Listing 6a. Defining a generic transformation in PMML
<DerivedField name="system_pressure" 
   dataType="string" optype="categorical">
   <Apply function="if">
      <Apply function="equal">
         <FieldRef field="categorical_pressure" />
         <Constant>high</Constant>
       </Apply>
       <!-- THEN -->
       <Apply function="*">
          <Constant>0.3</Constant>
          <FieldRef field="pressure" />
      </Apply>
      <!-- ELSE -->
      <Apply function="pow">
         <Constant>2</Constant>
         <Apply function="+">
            <Constant>1</Constant>
            <Apply function="log">
               <Apply function="*">
                  <Constant>0.34</Constant>
                  <Apply function="+">
                     <FieldRef field="pressure" />
                     <Constant>1</Constant>
                  </Apply>
               </Apply>
            </Apply>
         </Apply>
      </Apply>
   </Apply>
</DerivedField>

Model representation

Once data transformations are fully defined, it is time to represent the brain of the predictive solution, the model itself. The PMML representation for each modeling technique is highly dependent on its own structure and set of parameters. As described earlier, PMML offers an extensive list of elements to represent the most widely used techniques in predictive analytics.

The example depicted in Listing 7 shows the setting up of a neural network element in PMML. Neural layers, neurons, and connection weights are not shown. (See the article "What is PMML?" for how to represent neural layers and neurons in PMML).

Listing 7. Setting up a Neural Network element in PMML
<NeuralNetwork
   modelName="ElementAnalyzer" 
   functionName="classification" 
   activationFunction="tanh"
   numberOfLayers="2">

The NeuralNetwork element is composed of four attributes. The first, modelName, is used to specify the model name (simple enough, right?). The second, functionName, identifies the purpose of the model, which, in this case, is classification, as opposed to Regression. The third, activationFunction, specifies that the activation function to be used by the network neurons when processing incoming data is tanh, a sigmoid function commonly used in Neural Networks. Finally, the fourth attribute, numberOfLayers, specifies that the network is defined by two layers, which implies the existence of a single hidden layer as well as an output layer. Note that the input layer is not counted.

As can be seen, PMML is not rocket science. From just inspecting this particular element, you can have a good idea of the model's structure and what it is about: a Neural Network used to classify different elements. Listing 8 shows the definition of a Decision Tree for the same problem.

Listing 8. Setting up a Decision Tree element in PMML
<TreeModel modelName="ElementAnalyzer" algorithmName="CART" 
functionName="classification">

Note that, from the attribute algorithmName, you learn that this particular tree was trained with CART (Classification And Regression Tree). Trees built in IBM SPSS Statistics, for example, can benefit from CART to produce Decision Trees that you easily can export as PMML files.


Conclusion

The information age has come to us with a blessing: the availability of large volumes of data captured from transactions and sensors. This allows for the building of solutions that are able to predict malicious activities, failures, and accidents before they happen or cause harm. If you are to fully benefit from these solutions though, they must be paired with open-standards such as PMML. As applications and systems multiply, you must ensure they speak the same language. As PMML is used to represent data transformations and models, it becomes the conduit for the sharing of complete predictive solutions, from raw data to predictions.

Resources

Learn

Get products and technologies

  • IBM SPSS Statistics 18 (formerly SPSS Statistics): Put the power of advanced statistical analysis in your hands. Whether you are a beginner or an experienced statistician, its comprehensive set of tools will meet your needs.
  • ADAPA: Try a revolutionary predictive analytics decision management platform, available as a service on the cloud or for on site. It provides a secure, fast, and scalable environment to deploy your data mining models and business logic and put them into actual use.
  • IBM WebSphere Application Server: Build, deploy and manage robust, agile and reusable SOA business applications and services of all types while reducing application infrastructure costs with IBM WebSphere Application Server.
  • IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Open source, XML
ArticleID=548277
ArticleTitle=Representing predictive solutions in PMML
publish-date=09282010