# Representing predictive solutions in PMML

Move from raw data to predictions

## Introduction to PMML

Today, sensors are becoming ubiquitous, from smart meters in homes to the monitoring of equipment and structures such as deep water oil rigs. To make sense of all the data being gathered from these sensors, predictive analytics needs open standards, which allow for systems to communicate without the impediments of proprietary code and incompatibilities. PMML is the standard used to represent predictive analytic or data mining models. With PMML, a predictive solution can be built in one system and deployed in another where it can be put to work immediately.

For the Petroleum and Chemical industry, predictive maintenance is one application in which raw data captured from sensors can be pre-processed and used to build predictive solutions that can detect a machinery breakdown before it happens. In the wake of the Gulf of Mexico tragedy, predictive analytics and open standards can provide yet another tool for ensuring safety and process reliability.

As the de facto standard to represent predictive solutions, PMML allows model and data transformations to be represented together in a single and concise way. When used to represent all the computations that make up a predictive solution, PMML becomes the bridge not only between data analysis, model building, and deployment systems, but also between all the people and teams involved in the analytical process inside a company. This is extremely important, since it can be used to disseminate knowledge and best practices as well as ensure transparency.

## Predictive modeling techniques

This section focuses on all the predictive modeling techniques covered by specific PMML elements. Although a myriad of different techniques are proposed every year, they need to be proven and adopted by a large community of data mining practitioners before becoming part of the standard. As of Version 4.0, released in 2009, PMML offers specific elements for the following modeling or statistical techniques:

- Association Rules:
`AssociationModel`

element - Cluster Models:
`ClusteringModel`

element - Decision Trees:
`TreeModel`

element - Naïve Bayes Classifiers:
`NaïveBayesModel`

element - Neural Networks:
`NeuralNetwork`

element - Regression:
`RegressionModel`

and`GeneralRegressionModel`

elements - Rulesets:
`RuleSetModel`

element - Sequences:
`SequenceModel`

element - Support Vector Machines:
`SupportVectorMachineModel`

element - Text Models:
`TextModel`

element - Time Series:
`TimeSeriesModel`

element

These techniques allow you to extract patterns from historical data not obvious to the
human eye. *Association Rules*, for example, are frequently used to find out rules or
relationships between products in large scale transaction data. When presented with
sales data from a supermarket, Association Rules are used to find that customers who purchased items A and B also purchased item C. The information conveyed by Association Rules can then be used to drive marketing activities as well as the placement of products inside a store.

On the other hand, as its name suggests, *Cluster Models* are used to cluster data into specific buckets based on a pre-determined similarity measure. Cluster Models can be center-based where the cluster center is defined by a data vector or distribution-based where the center is defined by statistics. When put to work, a Cluster Model will assign incoming data to the cluster with the closest center.

Another commonly used modeling technique known as *Decision Trees* implements a
tree-like structure in which the data is partitioned by a series of decision nodes.
Leaf nodes define a particular class in case of classification trees. Decision Trees
are a favorite for applications in which the rationale behind a predictive decision
needs to be explained. The developerWorks article, "What is PMML?", focused on yet
another technique, namely *Neural Networks*, which offers a non-linear way to extract relationships between data fields (see Related topics for a link). Independent of the modeling technique used, however, the goal is clear: to be able to find patterns in data or to model complex relationships between the input data and the output they are trying to predict.

A recent trend in predictive analytics is to use a combination of several statistical
techniques, also known as a *model ensemble*, to solve a single problem. In
this scenario, each model produces a prediction which is then combined into an overall
result. As with the old belief that two heads are better than one, given that
different techniques look at data from different mathematical perspectives, their
combination can boost predictive performance. To address the use of more than one
technique or model to solve a single problem, PMML defines a multiple model element
called `MiningModel`

. It offers a series of methods that
allow you to combine the output from different models. Examples include majority vote and weighted average.

## Working with data in PMML

The model elements discussed above serve as the anchor in which a particular modeling technique is represented in PMML. In fact, each model element encapsulates all the attributes and sub-elements necessary to represent each modeling technique in detail, its parameters, and structure. But, as much as the brain of a predictive solution is its models, its eyes are the data that feed the model with raw and derived input fields. Since PMML is able to represent not only the brain, but also the eyes, it is capable of delivering the power necessary to define an entire predictive solution.

To accomplish this feat, PMML defines a number of elements and attributes, as well as a specific order governing their usage. A PMML file always starts with the elements used for data setup. Once data setup is accomplished, PMML allows for the definition of data pre-processing steps followed by the model itself. Let's take a look at all three of these steps, starting with data setup.

### Data setup

PMML specifies a series of elements that are used to define the data fields of interest. The `DataDictionary`

element is used to specify all the raw input data fields being used by a model. Listing 1 shows how a numerical field named `pressure`

is represented in the `DataDictionary`

element. Note that besides type information, it allows for specifying the interval of valid values. In this example, any values lower than 0 or greater than 100 are considered to be invalid.

##### Listing 1. `DataDictionary`

element in PMML

<DataDictionary> <DataField name="pressure" dataType="double" optype="continuous" > <Interval closure="closedClosed" leftMargin="0" rightMargin="100" /> </DataField> <!-- Other DataFields --> </MiningSchema>

Another PMML element for representing input data is `MiningSchema`

. This element is extremely important whenever a model is
deployed and put to work, since it defines what to do in case any
of the raw input fields defined in the `DataDictionary`

element are missing or contain invalid values. This element also allows for the
treatment of outliers—extreme values associated with a given input field.

In the real-world, far from the system where the model was built, sensors might
malfunction, providing distorted information or no information altogether. For these
situations, the `MiningSchema`

element provides the exact
process to be followed, which substantially increases the robustness of the overall
solution. Listing 2 shows the `MiningSchema`

representation for field `pressure`

.

##### Listing 2. Mining Schema element in PMML

<MiningSchema> <MiningField name="pressure" usageType="active" missingValueReplacement="35.32" missingValueTreatment="asMean" invalidValueTreatment="asMissing" outliers="asExtremeValues" lowValue="10" highValue="70"/> <!-- Other MiningFields --> </MiningSchema>

According to this example, if the incoming value is missing, it is replaced
with value 35.32, which represents the mean value for this field as computed
from the historical data. Also note that any invalid values (lower than 0 or
greater than 100—as defined in the `DataDictionary`

in Listing 1) are treated as missing values. If valid values lower than 10 or greater than 70 are encountered though, these are treated as outliers and automatically replaced by the values 10 or 70, respectively.

### Data pre-processing

Once the data setup is done, PMML allows for the definition of a vast array of data pre-processing steps. For that, it provides a set of elements for common data transformations as well as a list of built-in functions for the definition of arithmetic and logic operations. The data pre-processing computations are used to boost the predictive power of the raw input data or simply to prepare the data to be presented to the model itself. For example, many modeling techniques only take numerical fields as input. In this case, any categorical input will need to be transformed into numerical input before being used.

PMML provides these elements for pre-processing and data transformation:

`Normalization`

: Map continuous and discrete values to numbers.`Discretization`

: Map continuous values to discrete values.`Value mapping`

: Map discrete values to discrete values.`Functions`

: Derive a value by applying a function to one or more parameters.

Listing 3 shows the PMML normalization element `NormContinous`

.

In this example, PMML is used to transform the value of the input field `pressure`

to a value between 0 and 1. Note that the new normalized value is further assigned to a new field, a derived field named `normalized_pressure`

.

##### Listing 3. Normalization in PMML

<DerivedField name="normalized_pressure" dataType="double" optype="continuous"> <NormContinuous field="pressure"> <LinearNorm norm="0" orig="10"/> <LinearNorm norm="1" orig="70"/> </NormContinuous> </DerivedField>

In this example, PMML is used to transform the value of the input field `pressure`

to a value between 0 and 1. Note that the new normalized value is further assigned to a new field, a derived field named `normalized_pressure`

.

Normalizations such as these are commonly applied to data fields used as input to a
Neural Network model. When building your predictive analytic model using IBM® SPSS® Statistics, you automatically have the choice of exporting it as a PMML file. If you build a Neural Network model, all input fields used by the network will be normalized and the resulting PMML file will incorporate the element `NormContinuous`

for all continuous input fields.

Listing 4 shows the PMML discretization element `Discretize`

.

##### Listing 4. Discretization in PMML

<DerivedField name="categorical_pressure" dataType="string" optype="categorical"> <Discretize field="pressure"> <DiscretizeBin binValue="low"> <Interval closure="openClosed" rightMargin="25" /> </DiscretizeBin> <DiscretizeBin binValue="normal"> <Interval closure="openClosed" leftMargin="25" rightMargin="50" /> </DiscretizeBin> <DiscretizeBin binValue="high"> <Interval closure="openOpen" leftMargin="50" /> </DiscretizeBin> </Discretize> </DerivedField>

In this example, the numerical input field `pressure`

is binned into three categories (`low`

, `normal`

, and `high`

), which are assigned to a
new derived field named `categorical_pressure`

. The first bin
maps values up to 25 to `low`

. The second maps values greater
than 25 and less than or equal to 50 to `normal`

. The third and last bin maps values greater than 50 to `high`

.

Note that element `Discretize`

defines a set of `DiscretizeBin`

sub-elements that use the `Interval`

element in the same way as the `DataDictionary`

element in Listing 1. The reuse of generic elements inside specialized ones is a common theme in PMML. This makes the language more readable and, for analytical tools, easier to export and import.

Listing 5 shows the PMML mapping element `MapValues`

. In this example, the derived field `categorical_pressure`

created above is used as the input field to the
mapping transformation, which creates a field named `grouped_pressure`

. This is a great feature of PMML since it allows
derived fields to be created from other derived fields.

##### Listing 5. Mapping in PMML

<DerivedField name="grouped_pressure" dataType="integer" optype="categorical"> <MapValues outputColumn="group"> <FieldColumnPair column="C1" field="categorical_pressure" /> <InlineTable> <row> <C1>low</C1> <group>1</group> </row> <row> <C1>normal</C1> <group>1</group> </row> <row> <C1>high</C1> <group>2</group> </row> </InlineTable> </MapValues> </DerivedField>

Note that, in this case, the
`MapValues`

element groups input categories.
It uses the element `InlineTable`

to assign the categories `low`

and `normal`

to group 1 and category `high`

to group 2.

PMML also defines many built-in functions for arithmetic and logic operations together
with a generic `IF-THEN-ELSE`

function. When combined
with other functions, it provides a powerful representation medium for almost any kind of pre-processing task (see Listing 6).

##### Listing 6. `IF-THEN-ELSE`

function

IF categorical_pressure = "high" THEN system_pressure = 0.3 * pressure ELSE system_pressure = 2 ^ (1 + log (0.34* pressure + 1)

Listing 6a shows the PMML equivalent of the operation in Listing 6.

##### Listing 6a. Defining a generic transformation in PMML

<DerivedField name="system_pressure" dataType="string" optype="categorical"> <Apply function="if"> <Apply function="equal"> <FieldRef field="categorical_pressure" /> <Constant>high</Constant> </Apply> <!-- THEN --> <Apply function="*"> <Constant>0.3</Constant> <FieldRef field="pressure" /> </Apply> <!-- ELSE --> <Apply function="pow"> <Constant>2</Constant> <Apply function="+"> <Constant>1</Constant> <Apply function="log"> <Apply function="*"> <Constant>0.34</Constant> <Apply function="+"> <FieldRef field="pressure" /> <Constant>1</Constant> </Apply> </Apply> </Apply> </Apply> </Apply> </Apply> </DerivedField>

### Model representation

Once data transformations are fully defined, it is time to represent the brain of the predictive solution, the model itself. The PMML representation for each modeling technique is highly dependent on its own structure and set of parameters. As described earlier, PMML offers an extensive list of elements to represent the most widely used techniques in predictive analytics.

The example depicted in Listing 7 shows the setting up of a neural network element in PMML. Neural layers, neurons, and connection weights are not shown. (See the article "What is PMML?" for how to represent neural layers and neurons in PMML).

##### Listing 7. Setting up a Neural Network element in PMML

<NeuralNetwork modelName="ElementAnalyzer" functionName="classification" activationFunction="tanh" numberOfLayers="2">

The `NeuralNetwork`

element is composed of four attributes.
The first, `modelName`

, is used to specify the model name
(simple enough, right?). The second, `functionName`

, identifies the purpose of the model, which, in this case, is classification, as opposed to Regression. The third, `activationFunction`

, specifies that the activation function to be used by the network neurons when processing incoming data is `tanh`

, a sigmoid function commonly used in Neural Networks. Finally, the fourth attribute, `numberOfLayers`

, specifies that the network is defined by two layers, which implies the existence of a single hidden layer as well as an output layer. Note that the input layer is not counted.

As can be seen, PMML is not rocket science. From just inspecting this particular element, you can have a good idea of the model's structure and what it is about: a Neural Network used to classify different elements. Listing 8 shows the definition of a Decision Tree for the same problem.

##### Listing 8. Setting up a Decision Tree element in PMML

<TreeModel modelName="ElementAnalyzer" algorithmName="CART" functionName="classification">

Note that, from the attribute `algorithmName`

, you learn that
this particular tree was trained with CART (Classification And Regression Tree). Trees
built in IBM SPSS Statistics, for example, can benefit from CART to produce Decision
Trees that you easily can export as PMML files.

## Conclusion

The information age has come to us with a blessing: the availability of large volumes of data captured from transactions and sensors. This allows for the building of solutions that are able to predict malicious activities, failures, and accidents before they happen or cause harm. If you are to fully benefit from these solutions though, they must be paired with open-standards such as PMML. As applications and systems multiply, you must ensure they speak the same language. As PMML is used to represent data transformations and models, it becomes the conduit for the sharing of complete predictive solutions, from raw data to predictions.

#### Downloadable resources

#### Related topics

- What is PMML? (Alex Guazzelli, developerWorks, July 2010): Explore the power of predictive analytics and open standards in this introduction to PMML.
- PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics (Alex Guazzelli, Wen-Ching Lin, Tridivesh Jena; CreateSpace, May 2010): Learn to represent your predictive models as you take a practical look at PMML.
- The Data Mining Group (DMG): Explore multiple resources from this independent, vendor-led consortium that develops data mining standards such as the Predictive Model Markup Language (PMML).
- Zementis PMML Resources page: Review complete PMML examples, including Cluster Models, Decision Tees, Naïve Bayes Classifiers, Neural Network models, Regression Models, scorecards, and Support Vector Machines.
- The PMML page in Wikipedia: Find an overview of PMML plus links to specifications and more.
- The Predictive analytics page in Wikipedia: Read about the types, applications, and statistical techniques common to this area of statistical analysis.
- The Data Mining page in Wikipedia: Visit and read more about the process of extracting patterns from data.
- IBM SPSS Statistics 18 (formerly SPSS Statistics): Put the power of advanced statistical analysis in your hands. Whether you are a beginner or an experienced statistician, its comprehensive set of tools will meet your needs.
- ADAPA: Try a revolutionary predictive analytics decision management platform, available as a service on the cloud or for on site. It provides a secure, fast, and scalable environment to deploy your data mining models and business logic and put them into actual use.
- IBM WebSphere Application Server: Build, deploy and manage robust, agile and reusable SOA business applications and services of all types while reducing application infrastructure costs with IBM WebSphere Application Server.
- Industries library: See the developerWorks Industries library for technical articles and tips, tutorials, standards, and IBM Redbooks.
- IBM product evaluation versions: Get your hands on application development tools and middleware products.