Introduction to PMML
Today, sensors are becoming ubiquitous, from smart meters in homes to the monitoring of equipment and structures such as deep water oil rigs. To make sense of all the data being gathered from these sensors, predictive analytics needs open standards, which allow for systems to communicate without the impediments of proprietary code and incompatibilities. PMML is the standard used to represent predictive analytic or data mining models. With PMML, a predictive solution can be built in one system and deployed in another where it can be put to work immediately.
For the Petroleum and Chemical industry, predictive maintenance is one application in which raw data captured from sensors can be preprocessed and used to build predictive solutions that can detect a machinery breakdown before it happens. In the wake of the Gulf of Mexico tragedy, predictive analytics and open standards can provide yet another tool for ensuring safety and process reliability.
As the de facto standard to represent predictive solutions, PMML allows model and data transformations to be represented together in a single and concise way. When used to represent all the computations that make up a predictive solution, PMML becomes the bridge not only between data analysis, model building, and deployment systems, but also between all the people and teams involved in the analytical process inside a company. This is extremely important, since it can be used to disseminate knowledge and best practices as well as ensure transparency.
Predictive modeling techniques
This section focuses on all the predictive modeling techniques covered by specific PMML elements. Although a myriad of different techniques are proposed every year, they need to be proven and adopted by a large community of data mining practitioners before becoming part of the standard. As of Version 4.0, released in 2009, PMML offers specific elements for the following modeling or statistical techniques:
 Association Rules:
AssociationModel
element  Cluster Models:
ClusteringModel
element  Decision Trees:
TreeModel
element  Naïve Bayes Classifiers:
NaïveBayesModel
element  Neural Networks:
NeuralNetwork
element  Regression:
RegressionModel
andGeneralRegressionModel
elements  Rulesets:
RuleSetModel
element  Sequences:
SequenceModel
element  Support Vector Machines:
SupportVectorMachineModel
element  Text Models:
TextModel
element  Time Series:
TimeSeriesModel
element
These techniques allow you to extract patterns from historical data not obvious to the human eye. Association Rules, for example, are frequently used to find out rules or relationships between products in large scale transaction data. When presented with sales data from a supermarket, Association Rules are used to find that customers who purchased items A and B also purchased item C. The information conveyed by Association Rules can then be used to drive marketing activities as well as the placement of products inside a store.
On the other hand, as its name suggests, Cluster Models are used to cluster data into specific buckets based on a predetermined similarity measure. Cluster Models can be centerbased where the cluster center is defined by a data vector or distributionbased where the center is defined by statistics. When put to work, a Cluster Model will assign incoming data to the cluster with the closest center.
Another commonly used modeling technique known as Decision Trees implements a treelike structure in which the data is partitioned by a series of decision nodes. Leaf nodes define a particular class in case of classification trees. Decision Trees are a favorite for applications in which the rationale behind a predictive decision needs to be explained. The developerWorks article, "What is PMML?", focused on yet another technique, namely Neural Networks, which offers a nonlinear way to extract relationships between data fields (see Resources for a link). Independent of the modeling technique used, however, the goal is clear: to be able to find patterns in data or to model complex relationships between the input data and the output they are trying to predict.
A recent trend in predictive analytics is to use a combination of several statistical
techniques, also known as a model ensemble, to solve a single problem. In
this scenario, each model produces a prediction which is then combined into an overall
result. As with the old belief that two heads are better than one, given that
different techniques look at data from different mathematical perspectives, their
combination can boost predictive performance. To address the use of more than one
technique or model to solve a single problem, PMML defines a multiple model element
called MiningModel
. It offers a series of methods that
allow you to combine the output from different models. Examples include majority vote and weighted average.
Working with data in PMML
The model elements discussed above serve as the anchor in which a particular modeling technique is represented in PMML. In fact, each model element encapsulates all the attributes and subelements necessary to represent each modeling technique in detail, its parameters, and structure. But, as much as the brain of a predictive solution is its models, its eyes are the data that feed the model with raw and derived input fields. Since PMML is able to represent not only the brain, but also the eyes, it is capable of delivering the power necessary to define an entire predictive solution.
To accomplish this feat, PMML defines a number of elements and attributes, as well as a specific order governing their usage. A PMML file always starts with the elements used for data setup. Once data setup is accomplished, PMML allows for the definition of data preprocessing steps followed by the model itself. Let's take a look at all three of these steps, starting with data setup.
Data setup
PMML specifies a series of elements that are used to define the data fields of interest. The DataDictionary
element is used to specify all the raw input data fields being used by a model. Listing 1 shows how a numerical field named pressure
is represented in the DataDictionary
element. Note that besides type information, it allows for specifying the interval of valid values. In this example, any values lower than 0 or greater than 100 are considered to be invalid.
Listing 1. DataDictionary
element in PMML
<DataDictionary> <DataField name="pressure" dataType="double" optype="continuous" > <Interval closure="closedClosed" leftMargin="0" rightMargin="100" /> </DataField> <! Other DataFields > </MiningSchema>
Another PMML element for representing input data is MiningSchema
. This element is extremely important whenever a model is
deployed and put to work, since it defines what to do in case any
of the raw input fields defined in the DataDictionary
element are missing or contain invalid values. This element also allows for the
treatment of outliers—extreme values associated with a given input field.
In the realworld, far from the system where the model was built, sensors might
malfunction, providing distorted information or no information altogether. For these
situations, the MiningSchema
element provides the exact
process to be followed, which substantially increases the robustness of the overall
solution. Listing 2 shows the MiningSchema
representation for field pressure
.
Listing 2. Mining Schema element in PMML
<MiningSchema> <MiningField name="pressure" usageType="active" missingValueReplacement="35.32" missingValueTreatment="asMean" invalidValueTreatment="asMissing" outliers="asExtremeValues" lowValue="10" highValue="70"/> <! Other MiningFields > </MiningSchema>
According to this example, if the incoming value is missing, it is replaced
with value 35.32, which represents the mean value for this field as computed
from the historical data. Also note that any invalid values (lower than 0 or
greater than 100—as defined in the DataDictionary
in Listing 1) are treated as missing values. If valid values lower than 10 or greater than 70 are encountered though, these are treated as outliers and automatically replaced by the values 10 or 70, respectively.
Data preprocessing
Once the data setup is done, PMML allows for the definition of a vast array of data preprocessing steps. For that, it provides a set of elements for common data transformations as well as a list of builtin functions for the definition of arithmetic and logic operations. The data preprocessing computations are used to boost the predictive power of the raw input data or simply to prepare the data to be presented to the model itself. For example, many modeling techniques only take numerical fields as input. In this case, any categorical input will need to be transformed into numerical input before being used.
PMML provides these elements for preprocessing and data transformation:
Normalization
: Map continuous and discrete values to numbers.Discretization
: Map continuous values to discrete values.Value mapping
: Map discrete values to discrete values.Functions
: Derive a value by applying a function to one or more parameters.
Listing 3 shows the PMML normalization element NormContinous
.
In this example, PMML is used to transform the value of the input field pressure
to a value between 0 and 1. Note that the new normalized value is further assigned to a new field, a derived field named normalized_pressure
.
Listing 3. Normalization in PMML
<DerivedField name="normalized_pressure" dataType="double" optype="continuous"> <NormContinuous field="pressure"> <LinearNorm norm="0" orig="10"/> <LinearNorm norm="1" orig="70"/> </NormContinuous> </DerivedField>
In this example, PMML is used to transform the value of the input field pressure
to a value between 0 and 1. Note that the new normalized value is further assigned to a new field, a derived field named normalized_pressure
.
Normalizations such as these are commonly applied to data fields used as input to a
Neural Network model. When building your predictive analytic model using IBM® SPSS® Statistics, you automatically have the choice of exporting it as a PMML file. If you build a Neural Network model, all input fields used by the network will be normalized and the resulting PMML file will incorporate the element NormContinuous
for all continuous input fields.
Listing 4 shows the PMML discretization element Discretize
.
Listing 4. Discretization in PMML
<DerivedField name="categorical_pressure" dataType="string" optype="categorical"> <Discretize field="pressure"> <DiscretizeBin binValue="low"> <Interval closure="openClosed" rightMargin="25" /> </DiscretizeBin> <DiscretizeBin binValue="normal"> <Interval closure="openClosed" leftMargin="25" rightMargin="50" /> </DiscretizeBin> <DiscretizeBin binValue="high"> <Interval closure="openOpen" leftMargin="50" /> </DiscretizeBin> </Discretize> </DerivedField>
In this example, the numerical input field pressure
is binned into three categories (low
, normal
, and high
), which are assigned to a
new derived field named categorical_pressure
. The first bin
maps values up to 25 to low
. The second maps values greater
than 25 and less than or equal to 50 to normal
. The third and last bin maps values greater than 50 to high
.
Note that element Discretize
defines a set of DiscretizeBin
subelements that use the Interval
element in the same way as the DataDictionary
element in Listing 1. The reuse of generic elements inside specialized ones is a common theme in PMML. This makes the language more readable and, for analytical tools, easier to export and import.
Listing 5 shows the PMML mapping element MapValues
. In this example, the derived field categorical_pressure
created above is used as the input field to the
mapping transformation, which creates a field named grouped_pressure
. This is a great feature of PMML since it allows
derived fields to be created from other derived fields.
Listing 5. Mapping in PMML
<DerivedField name="grouped_pressure" dataType="integer" optype="categorical"> <MapValues outputColumn="group"> <FieldColumnPair column="C1" field="categorical_pressure" /> <InlineTable> <row> <C1>low</C1> <group>1</group> </row> <row> <C1>normal</C1> <group>1</group> </row> <row> <C1>high</C1> <group>2</group> </row> </InlineTable> </MapValues> </DerivedField>
Note that, in this case, the
MapValues
element groups input categories.
It uses the element InlineTable
to assign the categories low
and normal
to group 1 and category high
to group 2.
PMML also defines many builtin functions for arithmetic and logic operations together
with a generic IFTHENELSE
function. When combined
with other functions, it provides a powerful representation medium for almost any kind of preprocessing task (see Listing 6).
Listing 6. IFTHENELSE
function
IF categorical_pressure = "high" THEN system_pressure = 0.3 * pressure ELSE system_pressure = 2 ^ (1 + log (0.34* pressure + 1)
Listing 6a shows the PMML equivalent of the operation in Listing 6.
Listing 6a. Defining a generic transformation in PMML
<DerivedField name="system_pressure" dataType="string" optype="categorical"> <Apply function="if"> <Apply function="equal"> <FieldRef field="categorical_pressure" /> <Constant>high</Constant> </Apply> <! THEN > <Apply function="*"> <Constant>0.3</Constant> <FieldRef field="pressure" /> </Apply> <! ELSE > <Apply function="pow"> <Constant>2</Constant> <Apply function="+"> <Constant>1</Constant> <Apply function="log"> <Apply function="*"> <Constant>0.34</Constant> <Apply function="+"> <FieldRef field="pressure" /> <Constant>1</Constant> </Apply> </Apply> </Apply> </Apply> </Apply> </Apply> </DerivedField>
Model representation
Once data transformations are fully defined, it is time to represent the brain of the predictive solution, the model itself. The PMML representation for each modeling technique is highly dependent on its own structure and set of parameters. As described earlier, PMML offers an extensive list of elements to represent the most widely used techniques in predictive analytics.
The example depicted in Listing 7 shows the setting up of a neural network element in PMML. Neural layers, neurons, and connection weights are not shown. (See the article "What is PMML?" for how to represent neural layers and neurons in PMML).
Listing 7. Setting up a Neural Network element in PMML
<NeuralNetwork modelName="ElementAnalyzer" functionName="classification" activationFunction="tanh" numberOfLayers="2">
The NeuralNetwork
element is composed of four attributes.
The first, modelName
, is used to specify the model name
(simple enough, right?). The second, functionName
, identifies the purpose of the model, which, in this case, is classification, as opposed to Regression. The third, activationFunction
, specifies that the activation function to be used by the network neurons when processing incoming data is tanh
, a sigmoid function commonly used in Neural Networks. Finally, the fourth attribute, numberOfLayers
, specifies that the network is defined by two layers, which implies the existence of a single hidden layer as well as an output layer. Note that the input layer is not counted.
As can be seen, PMML is not rocket science. From just inspecting this particular element, you can have a good idea of the model's structure and what it is about: a Neural Network used to classify different elements. Listing 8 shows the definition of a Decision Tree for the same problem.
Listing 8. Setting up a Decision Tree element in PMML
<TreeModel modelName="ElementAnalyzer" algorithmName="CART" functionName="classification">
Note that, from the attribute algorithmName
, you learn that
this particular tree was trained with CART (Classification And Regression Tree). Trees
built in IBM SPSS Statistics, for example, can benefit from CART to produce Decision
Trees that you easily can export as PMML files.
Conclusion
The information age has come to us with a blessing: the availability of large volumes of data captured from transactions and sensors. This allows for the building of solutions that are able to predict malicious activities, failures, and accidents before they happen or cause harm. If you are to fully benefit from these solutions though, they must be paired with openstandards such as PMML. As applications and systems multiply, you must ensure they speak the same language. As PMML is used to represent data transformations and models, it becomes the conduit for the sharing of complete predictive solutions, from raw data to predictions.
Resources
Learn
 What is PMML? (Alex Guazzelli, developerWorks, July 2010): Explore the power of predictive analytics and open standards in this introduction to PMML.
 PMML: Read Alex Guazzelli's knol about PMML modeling techniques.
 PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics (Alex Guazzelli, WenChing Lin, Tridivesh Jena; CreateSpace, May 2010): Learn to represent your predictive models as you take a practical look at PMML.
 The Data Mining Group (DMG): Explore multiple resources from this independent, vendorled consortium that develops data mining standards such as the Predictive Model Markup Language (PMML).
 Zementis PMML Resources page: Review complete PMML examples, including Cluster Models, Decision Tees, Naïve Bayes Classifiers, Neural Network models, Regression Models, scorecards, and Support Vector Machines.
 The PMML page in Wikipedia: Find an overview of PMML plus links to specifications and more.
 The Predictive analytics page in Wikipedia: Read about the types, applications, and statistical techniques common to this area of statistical analysis.
 The Data Mining page in Wikipedia: Visit and read more about the process of extracting patterns from data.
 Industries zone on developerWorks: Get all the latest industryspecific technical resources for developers.
 Industries library: See the developerWorks Industries library for technical articles and tips, tutorials, standards, and IBM Redbooks.
 My developerWorks: Personalize your developerWorks experience.
 developerWorks technical events and webcasts: Stay current with technology in these sessions.
 developerWorks on Twitter: Join today to follow developerWorks tweets.
 developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
 IBM SPSS Statistics 18 (formerly SPSS Statistics): Put the power of advanced statistical analysis in your hands. Whether you are a beginner or an experienced statistician, its comprehensive set of tools will meet your needs.
 ADAPA: Try a revolutionary predictive analytics decision management platform, available as a service on the cloud or for on site. It provides a secure, fast, and scalable environment to deploy your data mining models and business logic and put them into actual use.
 IBM WebSphere Application Server: Build, deploy and manage robust, agile and reusable SOA business applications and services of all types while reducing application infrastructure costs with IBM WebSphere Application Server.
 IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
 The Predictive Model Markup Language (PMML) group: Join the discussion about PMML on LinkedIn.
 developerWorks blogs: Check out these blogs and get involved.
Comments
Dig deeper into Big data and analytics on developerWorks

developerWorks Premium
Exclusive tools to build your next great app. Learn more.

dW Answers
Ask a technical question

Explore more technical topics
Tutorials & training to grow your development skills