The terms Big Data, predictive analytics, and cloud computing seem to be everywhere these days:
- Big Data, as the term suggests, refers to lots and lots of data captured from different sources and obtained in different formats. Data sources may encompass people or sensor data that can be structured or unstructured. For example, transactional data generated by people is structured; Tweet feeds are unstructured. The big questions pertaining to Big Data are "how to extract insights and value from it" and "how to extract those insights more effectively". The answer to both questions involves advanced analytics.
- Analytics is a broad term referring to descriptive analytics as well as predictive analytics. While the first lets you know what happened in the past, the second tells you what will happen next. Predictive analytics uses descriptive analytics as a stepping stone to make decisions in an ever more precise and consistent manner. Predictive analytic techniques are able to learn relevant patterns from historical data and use this knowledge to forecast behavior. They do that by combining data with clever mathematics.
However, data and mathematics are not the entire answer; you also need an infrastructure in place that is able to handle the data and complex algorithms. In the past, predictive solutions were bound to very specific problems and limited in scope mostly due to the unavailability of cost-efficient processing power. Not anymore. Cloud computing has turned this equation upside down by providing virtually unlimited power at a low-cost.
At its core, cloud computing is a set of services that provides computing resources via the Internet. Large data centers deliver scalable, on-demand, and often virtualized resources as a service, eliminating the need for investments in specific hardware, software, or on your own data center infrastructure.
Cloud computing allows for a variety of services, including storage capacity, processing power, and business applications. Accessing services on the cloud is not a new concept, but only recently did it become available as a secure and reliable infrastructure. IBM® SmartCloud Enterprise is a prime example of a generic cloud infrastructure. Powered by IBM, it provides dynamic compute capacity in the cloud through several data centers spread throughout the world.
This article describes predictive analytics basics such as PMML, the common language for data mining models, what it is and its components. Then it introduces you to a real-world PMML engine and discusses how it deploys and executes predictive solutions. Finally, see an example of how the engine can be used on IBM SmartCloud Enterprise.
Whenever a predictive analytic technique is trained on solving a specific problem, the result will be a predictive model. A predictive solution encompasses not only the model itself, but also all the data transformations that go into preparing the data for consumption by the model.
Data pre-processing is used to take care of any flaws present in the original raw data, such as missing values and outliers. Its ultimate goal, however, is to augment the predictive power of the raw input fields, transforming them into features.
Data is also pre-processed to make it suitable for "training" (optimized through experience). For example, neural networks, a classic predictive analytic technique, will only take numerical values as inputs. In this case, a categorical field will need to be converted to a continuous field before being presented to the network.
A predictive solution is usually built in a model development environment. Examples are IBM SPSS Modeler and Statistics or the open source R programming language and software environment for statistical computing. These environments offer great flexibility for data scientists to analyze and massage historical data in order to train a predictive model.
Once built and validated, a predictive solution is then easily exported into PMML (Predictive Model Markup Language) for model deployment. PMML is the de facto standard used to represent predictive analytic solutions. With PMML, model deployment is a breeze since no custom coding is necessary to move a solution from the scientist's desktop to the deployment environment where it will be put to use.
PMML is the brainchild of the Data Mining Group, a vendor-led committee composed of commercial and open source analytics companies. As a consequence, most of the top model development environments can export PMML. A mature and refined standard which has evolved over the past 10 years, PMML can represent not only the predictive techniques used to learn patterns from the data, but also pre-processing of raw input data and post-processing of model outputs.
PMML is based on XML (it is human- and machine-readable). The structure of a PMML file reflects the predictive solution it implements (see Figure 1).
Figure 1. A single PMML file contains several elements that reflect the predictive solution it implements
Different language elements are responsible for describing:
- The raw input data.
- Appropriate treatments for outliers, as well as missing and invalid values.
- Pre-processing of model inputs, including normalization, mapping, discretization, as well as a host of functions for logical and arithmetic manipulations.
- Specific model elements for representing predictive techniques.
- Post-processing of model output including scaling and business decisions.
PMML is also remarkable for its ability to represent multiple models with a single language element. In this way, a single PMML file can contain a model ensemble, segmentation, chaining, or composition.
IBM SPSS Modeler and IBM SPSS Statistics allow for a myriad of models to be exported in PMML. These environments are also outstanding in their ability to output data transformations into PMML. IBM SPSS Statistics, for example, allows for automatic data augmentation which can be exported into a stream of PMML derived fields. R, on the other hand, is remarkable for allowing users to export model ensembles into PMML. For example, a user can build a random forest model in R and export its hundreds of trees into PMML which can then be easily moved into the deployment environment and put to work in minutes.
Saving a solution as a PMML file is good practice even if it is not final. This allows data scientists to keep a PMML record of all the attempts taken before reaching the final solution. The data analytics team can then use this record to determine the best choice of parameters and practices.
To add to your knowledge on PMML, read What is PMML?, the author's article on the PMML standard. See Resources.
Now let's look at the real-world example.
Zementis Inc. provides a PMML-based predictive analytics decision management platform called ADAPA. It is able to consume predictive solutions expressed in PMML and execute them in real-time. Since ADAPA lives on the operational side, it frees IT resources from the burden of custom-coding the predictive solution to fit the operational environment. It also allows data scientists the opportunity to deploy predictive solutions on their own.
The ADAPA Decision Engine is a great example of a deployment platform for predictive solutions in which PMML takes center stage. In this regard, it boasts two important features:
- It is a universal PMML consumer since it accepts not only PMML files generated by any PMML-compliant application, but also PMML files specified in older versions of the standard.
- Besides supporting the modeling techniques themselves, ADAPA also supports all that PMML offers in terms of pre- and post-processing. In fact, it takes it a step further. If a predictive solution implements functions that are not part of the PMML standard, ADAPA allows for their implementation in Java™ (see Figure 2). The resulting JAR file can then be uploaded into the engine as a resource and any of its containing functions directly instantiated from PMML.
Figure 2. Extend the PMML standard by allowing custom functions to be embedded as a resource coded in Java
As shown in Figure 2, in addition to its PMML-based predictive analytics engine, ADAPA also incorporates the full functionality of a rules engine. In fact, it provides seamless integration of predictive analytics and business rules. In this way, it allows data-driven insight and expert knowledge to be combined into a single decision strategy.
Next, an outline on how to deploy and execute predictive solutions, using ADAPA as the example.
Given PMML and ADAPA, the process of deploying a predictive model is equivalent to uploading a corresponding PMML file into the engine. Whenever uploaded successfully, a model is ready to be executed, either through web services or through the ADAPA Web Console. Users can also access models in ADAPA directly from within Excel (see Figure 3).
Figure 3. Models can be deployed and tested in ADAPA through its Web Console
In effect, web services allow for applications throughout the enterprise to access models and their predictions in real time. On-demand and batch-mode execution can also be accomplished the same way while also benefiting from the ADAPA Web Console. This serves as an interactive admin portal where models can be manually managed and verified.
Business users also benefit from being able to access models and score data directly from within Microsoft® Office Excel® by using the ADAPA Add-in. It allows for complex predictive solutions to be used without the complications involved in model building and deployment. With the add-in, users simply select the data they want to score in Excel, choose the appropriate model from a list of available models, and click Score.
Note that you now operate in a true cross-platform, multi-vendor environment. Since models can be developed in various, PMML-compliant tools, a prudent step in the deployment process is model verification which ensures that both the scoring engine and the model development environment produce exactly the same results. ADAPA provides an integrated testing process to make sure a model was uploaded successfully and works as expected. It allows for a test file containing any number of records with all the necessary input variables and the expected result for each record to be uploaded for score matching. The same process may also be embedded into the PMML file itself which, in this case, will have an extra element specifically for model verification.
Once model verification is completed, statistics are returned on total amount of matched and unmatched records and percentages. If any records failed the matching test, a list of failed records is displayed. One can then trace through computed information for each record to locate where expected and computed values differ and thus pinpoint the source of the problem.
As mentioned before, ADAPA uses web service calls to allow for automatic decisions to be virtually embedded into systems and applications throughout the enterprise. To minimize total cost of ownership, model execution in ADAPA is available as a service through SmartCloud Enterprise (see Resources).
The SaaS license model provides the opportunity for vendors like Zementis to deliver software solutions as a cost-effective service that scales with the user's demand and is paid for based on actual consumption like your utility bill. ADAPA on SmartCloud is a fully hosted SaaS solution. Users only pay for the service and the capacity on a monthly basis, eliminating the necessity for expensive software licenses and in-house hardware resources. The SaaS model removes the burden for users to manage a scalable, on-demand computing infrastructure.
The process of launching a virtual ADAPA server in the IBM SmartCloud corresponds to the traditional scenario of buying hardware and installing it in a server room. The only difference is that the server in this case sits in the cloud, comes with a pre-installed version of ADAPA, and launches in just a few minutes, on-demand and ready to be used. At any given time, you can have one or more instances running.
Independent of processing power, each instance type provides a single-tenant architecture. The service is implemented as a private, dedicated instance that encapsulates predictive models and business rules. In this way, access (via HTTPS) to any instance is private. As a consequence, decision files and data never share the same engine with other clients.
Predictive analytics is revolutionizing the way companies do business today. As a discipline, it allows for predictive solutions to be built that peer into the ever-increasing data that we as a society are collecting from people (through transactional data) and sensors. After the data is analyzed and transformed, it is used as input to a predictive technique which is responsible for learning the important patterns hidden in it. Whenever this happens, a predictive model is born.
However, to be put to work, it needs to be moved from the scientist's desktop to the operational environment. For that, we benefit from the PMML standard. It allows for a predictive solution to be built in one tool and be easily moved to another for execution.
A mature and refined standard, PMML is supported by all the top data-mining tools. These include commercial and open source environments. As support for PMML increases, different tools are getting even more elaborate on how they use PMML for representing their predictive solutions. These range from comprehensive data pre-processing all the way to model ensembles in which hundreds of models are represented on a single PMML file and the output is the weighted average of all models.
Universal PMML consumers, such as the Zementis ADAPA Decision Engine, allow for predictive solutions to be put to work instantly. PMML frees IT resources since there is no need for recoding or custom implementations. In this way, the same PMML file generated by the model development environment can be directly uploaded to the consumer where it is readily available for execution.
Once deployed and put to use, a predictive solution is able to apply its knowledge to new situations and therefore generate predictions that can significantly change the business landscape. When combined with business rules, such predictions can be used to drive automatic decisions which benefit not only from data-driven knowledge, but expert knowledge expressed as rules.
Powered by the cloud, Big Data is making it possible for predictive solutions to get out of the box by providing a more complete picture of the problems it can address. When these solutions are combined with open standards, predictive analytics achieves its full potential. In our ever-faster world, it becomes as agile as it ought to be.
Resources related to the topics in this article:
- What is PMML? Read Alex Guazzelli's article on the PMML standard used by analytics companies to represent and move predictive solutions between systems.
- The book PMML in Action demonstrates how to unleash the power the open standards for data mining and predictive analytics.
- IBM SPSS Statistics and SPSS Modeler shows you how a predictive approach can offer deep insights.
- Try the ADAPA predictive analytics decision management platform on SmartCloud Enterprise.
In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
- Find out how to access IBM SmartCloud Enterprise.
Get products and technologies
See the product images available for IBM SmartCloud Enterprise.
Join a cloud computing group on developerWorks.
Read all the great cloud blogs on developerWorks.
Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Dr. Alex Guazzelli is the VP of Analytics at Zementis Inc. where he is responsible for developing core technology and predictive solutions under the name ADAPA, a PMML-based decisioning platform. With more than 20 years of experience in predictive analytics, Dr. Guazzelli holds a PhD in Computer Science from the University of Southern California and has co-authored the book PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics, now in its second edition.