Predictive analytics solutions are comprised of techniques such as artificial neural networks and decision trees (among a myriad of other statistical techniques) that are able to learn patterns present in historical data. They can subsequently apply the obtained knowledge to detect or predict trends in new data. Today, predictive analytics permeates our everyday lives, from fraud detection in financial transactions (every time you use your credit card to buy something at a store or online, it is analyzed for its fraud potential) to marketing and recommender systems. In this article we discuss not only how these techniques can be applied in healthcare, but also how the PMML standard can substantially ease the operational deployment of any predictive solution in the healthcare space.
In the early nineties, I was fortunate enough to work with the late Ricardo Machado, one of the top Artificial Intelligence (AI) researchers at the IBM Scientific Research Center in Rio de Janeiro, Brazil. Ricardo and his collaborators published many papers on neural networks and a predictive expert system named Next. The power of this system stemmed from its ability to use "knowledge graphs" obtained from interviews with medical experts to form the basis of a model capable of altering these graphs when presented with data, thus transforming them into an artificial neural network. Next was successfully used to diagnose and classify kidney diseases. Inspired by the results obtained with Next, Beatriz Leao, who first proposed the knowledge graph methodology used by Ricardo, developed a system called HYCONES, which also combined symbolic knowledge and neural networks. Working with Beatriz at the Cardiology Institute in Brazil, we were able to use HYCONES to successfully detect and classify congenital heart diseases. The results of our work were published in M.D. Computing in 1994.
Given that research in predictive analytics and healthcare goes back in time a number of years, you may be wondering why it took so long to actually move all the early scientific successes into our everyday lives. The answer is somewhat simple; the healthcare industry has been slow in embracing the digital age. Even if you see a doctor today in the US, chances are that most of the information gathered during your visit is still written down by hand into your medical record, and x-rays are still printed and appended to your file. Therefore, making this data available for data mining and predictive analytics remains a challenge even today.
However, we also know that more and more information about patients and providers are now being stored digitally. In the U.S., Kaiser Permanente, together with other major healthcare organizations, for example, has been in the forefront of embracing electronic medical records. There is even a big push for that to happen in emerging economies and developing countries. Beatriz Leao, who founded the Brazilian Health Informatics Association in 1986, understands all the benefits associated with standards and electronic health records. Over the years, she has been working relentlessly to develop the much needed health informatics infrastructure in African countries, first as a consultant for the World Health Organization in Mozambique and later for Jhpiego, a non-profit health organization affiliated with the John Hopkins University, in Rwanda (see Resources).
Healthcare and predictive analytics
When lots of data are available digitally, it is readily available to be mined. Through data mining and predictive analytics, historical data can reveal patterns that are used to predict trends. Historically, predictive analytics, together with expert knowledge, has been used to assist in the diagnosis and treatments of numerous diseases. Systems such as Next and HYCONES are early examples. Predictive solutions in this field can make an enormous impact in areas where medical expertise is sparse or non-existent. As on-line data and predictive systems become pervasive, they allow for faster and more precise decision-aide tools for healthcare providers. Lately, predictive systems are proving to be even more resourceful. As I reported late last year in another article about predictive analytics and standards (see Resources), IBM and the University of the Ontario Institute of Technology are currently working together to implement a data analysis and predictive solution to monitor premature babies in which biomedical readings can be used to detect life threatening infections up to 24 hours before they would normally be observed.
By knowing in advance that a group of patients are at a low or high risk for a disease or condition, data mining and predictive analytics are also helping healthcare providers create targeted treatment measures for different populations. For example, in the case of cardiovascular disease, by working hand in hand with patients identified by a predictive solution to be at a high risk, simple preventive measures can be implemented such as cutting down the intake of trans fats, losing weight, and quitting smoking, which can substantially reduce the risk of a heart attack. In this way, healthcare providers can devise different strategies to keep low risk patients at low risk, while mitigating the risk associated with high risk patients.
Under the U.S. federal health law, hospitals with higher than expected readmission rates will now receive less Medicare reimbursement. The Medicare Payment Advisory Commission estimated that in 2005 readmissions cost the Medicare program $15 billion, $12 billion of which could have been avoided (see Resources). Given that a great percentage of readmissions are preventable, predictive analytics is already being used as a helping hand for hospitals to cut down readmission rates. Although a simple follow-up appointment goes a long way to prevent hospital readmissions, predictive analytics can pin point exactly which patients need to be followed up closely. It can also assist hospitals in identifying populations who may need further assistance with regimens as simple as understanding dietary restrictions.
Predictive systems have been utilized for many years in the financial industry for fraud detection. Today, the majority of credit card transactions are evaluated for their fraud risk by a predictive solution in real-time. If deemed high-risk, these solutions can even decline a transaction and therefore prevent fraud from ever happening. Given that the cost associated with Medicare fraud is a much larger than the cost associated with readmissions, it is bound to become the main focus of predictive solutions. The proven success achieved with predictive techniques such as neural networks in detecting fraud in the financial industry can and should be used to detect fraud and abuse in healthcare.
If you have reviewed an explanation of benefits from your health insurance company, you know all too well that every single treatment, disease or condition is paired with a code. Although all the detailed coding can help in the building of fraud and abuse detection models, it also represents a challenge since claims data needs to be highly pre-processed and simplified before serving as input to a predictive system. Unfortunately, in terms of assisted diagnosis or preventive care, claims data is notoriously poor for not providing an indication of how severe the disease or condition is. And so, better data may be necessary to obtain better predictions.
The use of predictive analytics in healthcare will benefit from the merging of different data repositories. The more we know about an individual or population, that is, the bigger the picture, the more precise the predictions will be. With more data points, models can be tailored to a specific patient or group of patients that ultimately leads to more precise and effective treatments that are bound to improve the overall efficacy of the healthcare system while at the same time reducing costs.
The PMML language
Predictive analytic solutions are usually built and validated by a team of data mining scientists. The actual operational deployment of these solutions is usually a task performed by a team of engineers. On the one hand, data mining scientists are experts in statistics and statistical packages that they use to create the best predictive models. On the other hand, engineers specialize in programming languages, databases, and IT systems. For this reason, the traditional deployment of a predictive solution, that is, the process of moving it from the scientist's desktop to the environment where it will be put to work, may get lost in translation. In this scenario, once a predictive model leaves the scientist's domain, it needs to be recoded so that it will work in production. This process is laborious, prone to errors, and can take months.
To avoid such a scenario, the use of a standard that can represent data mining and predictive analytic solutions is paramount. PMML is just such a standard. PMML is the brain child of the Data Mining Group, a consortium of commercial and open-source data mining companies (see Resources). It allows for a solution to be built in one system and easily visualized or deployed in another. For example, PMML can be automatically exported from IBM SPSS Statistics or Modeler and imported in KNIME, a data mining tool used for building data work flows. It can also be easily moved and deployed in ADAPA, the Zementis scoring engine, where it can be put to work in minutes in any production environment.
PMML — What is new in Version 4.1
PMML is the de facto standard to represent predictive solutions, including the pre-processing of raw input data as well as the predictive technique itself. As a standard, PMML has been around for more than 10 years. Version 4.1 is to be released in December 2011. It builds upon Version 4.0, which provided extended support for multiple models. PMML 4.1 takes multiple models to a new level and makes it easier for the expression of model ensembles and segmentation. Multiple models usually combine different predictive techniques to generate a single prediction. Decision trees and neural networks are some well-known techniques used in data mining and predictive analytics and so have been supported by PMML since its inception. As the language matured, more and more techniques were incorporated into its structure. PMML 4.1 is no exception. It provides new language elements for representing Scorecards and K-Nearest Neighbors.
Perhaps the most famous scorecard in use today is the one behind the FICO score, which is used to assess an individual's risk of default in the financial arena. Besides being capable of detecting trends, scorecards are famous for being able to explain the reasoning behind their output or score. In healthcare, this becomes an important feature since there is a need to know why a patient is being classified as high or low risk. Traditional neural networks, on the other hand, are known for being a "black box" simply because it is very difficult to extract the reasoning behind their output. That's because neural networks as their name implies, try to mimic the way we learn. As Beatriz Leao found out when trying to construct knowledge graphs from her interviews with medical experts, they have a hard time explaining the rationale behind a diagnosis. When pressed, they tended to identify very few findings leading to a particular diagnosis. Knowledge graphs obtained from medical experts tend to be lean. The graphs obtained from doctors in residency, on the other hand, are large and broad and consider every single detail in the patient's medical record before one or a few diagnoses are reached. The rationale in the latter group was closely tied to the knowledge obtained from a medical encyclopedia. As Ricardo Machado found out, once these novice knowledge graphs were submitted to neural network training, they ended up resembling the knowledge graphs obtained from experts.
Being able to understand the reasons behind a prediction is represented in
PMML by an attribute called
reasonCode. PMML is
an XML-based language and so one can understand not only the reasons
behind the score, but also the model itself. For example, the PMML code
shown in Listing 1 was taken from inside a PMML
"Scorecard" element. From a quick inspection, one can readily see that it
contains the derivation of points for input data field "age". If, for
example, the age is between 59 and 69, the model dictates that 12 points
are to be assigned to "agePoints".
In a scorecard, the final score is computed from the sum of the partial scores obtained from all its characteristics. In case of hospital readmissions, the final score can be computed from a number of risk factors or characteristics. These vary from age and number of previous readmissions to specifics such as blood creatine and ammonia levels. When all partial scores are computed, the number of points contributed by "age" is compared to the points obtained from all other characteristics (not shown in Listing 1). The result of this comparison will dictate which reason codes will be output. The more a characteristic influences the final score, the more important it is in explaining it. In case age is chosen to be an important factor, reason code "RC3" will be output, which can subsequently be translated into a pertinent explanation.
Listing 1. Representing a scorecard characteristic in PMML
<Characteristic name="agePoints" reasonCode="RC3" baselineScore="18"> <Attribute partialScore="-1"> <SimplePredicate field="age" operator="isMissing"/> </Attribute> <Attribute partialScore="-3"> <SimplePredicate field="age" operator="lessOrEqual" value="38"/> </Attribute> <Attribute partialScore="0"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="age" operator="greaterThan" value="38"/> <SimplePredicate field="age" operator="lessOrEqual" value="59"/> </CompoundPredicate> </Attribute> <Attribute partialScore="12"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="age" operator="greaterThan" value="59"/> <SimplePredicate field="age" operator="lessOrEqual" value="69"/> </CompoundPredicate> </Attribute> <Attribute partialScore="18"> <SimplePredicate field="age" operator="greaterThan" value="69"/> </Attribute> </Characteristic>
PMML 4.1 also allows for decisions to be incorporated into a predictive
solution as part of the post-processing of the prediction itself. For
example, when a predictive model generates a score, PMML now allows for
this score to be compared against one or more thresholds. The result of
such a comparison can be used to divide patients into a number of
operational buckets that may consist of different diagnoses, follow-up
strategies or treatment plans. In the PMML code shown in Listing 2, the final score is compared with a threshold of 67. If
FinalScore is greater than 67, then as
defined in the second "OutputField" element, the model outcome will be
"Yes", which implies that a follow-up appointment needs to be scheduled.
If less or equal to 67, the outcome will be "No", which implies that a
follow-up appointment is not necessary.
Listing 2. Post-processing in PMML, from scores to decisions
<OutputField dataType="double" feature="predictedValue" name="FinalScore" optype="continuous" /> <OutputField dataType="string" feature="decision" name="Outcome" optype="categorical"> <Decisions businessProblem="Should a follow-up appointment be scheduled?" description="The decision depends on the likelihood of readmission."> <Decision value="Yes" description="Follow-up appointment is necessary."> <Decision value="No" description="No need for follow-up appointment."> </Decisions> <Apply function="greaterThan"> <FieldRef field="FinalScore" /> <Constant>67</Constant> </Apply> <!--THEN--> <Constant>Yes</Constant> <!--ELSE--> <Constant>No</Constant> </Apply> </OutputField>
PMML is already being used to express predictive solutions that are helping hospitals diminish readmission rates. It is also being used to express fraud detection models. Since a PMML file is by itself a document explaining the predictive solution, it can be used to log all the decisions taken into building not only the strategies around the score, but the score itself. As with any other industry or segment, PMML makes the use of predictive analytics in healthcare transparent. Given that it is a standard, it can be easily understood by all systems and people involved in the healthcare process. Therefore, it can be used to disseminate best practices as well as enforce compliance with laws and regulations. For example, one can easily make sure a solution does not use any personal identification data just by inspecting the resulting PMML file for that solution.
From model building to model deployment
PMML allows for predictive solutions to be shared between PMML-compliant applications and systems. In this way, for example, a model may be built using IBM SPSS Statistics, exported in PMML, and easily deployed into ADAPA, the Zementis scoring engine. Once deployed, it can be put to work right away. In this scenario, the beauty of representing predictive solutions via a standard such as PMML lies on the ability to instantaneously move a model from the scientist's desktop to the production environment. Whenever the data changes and an existing predictive solution needs to refreshed, a term that usually implies that the model needs to be rebuilt, it can be deployed again in minutes. This sounds obvious and straightforward, but without a standard such as PMML, the deployment of a predictive solution can take months since once a model is built, it needs to be described, usually in textual format, and subsequently custom coded into the production environment. As aforementioned, besides being error prone, this process takes up valuable resources and has no place in a healthcare system that needs to be agile, adaptable, and cost-effective.
Intelligent systems have historically been applied to the classification and diagnosis of different diseases. However, healthcare providers and patients are just beginning to benefit from predictive analytics. As more and more data moves on line, we are bound to see many more predictive solutions, from the monitoring of patients in an ICU to the detection of fraud and abuse. All of these solutions have now the ability to become ever more precise not only due to the availability of large volumes of digital data, but also due to cost-effective storage and the enormous processing power available through different IT solutions, including Cloud Computing and Hadoop environments.
The availability of a standard such as PMML increases transparency, fosters best practices, lowers cost, saves time, and could ultimately save lives. With PMML, the entire healthcare industry benefits from a single standard to represent all its predictive needs, from data pre-processing and predictive technique, to post-processing of scores into meaningful operational practices. Embracing a standard has never felt better.
- Read the book PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics (May 2010).
- What is PMML? Explore the power of predictive analytics and open standards (Alex Guazzelli, developerWorks, September 2010): Review the basics. PMML enables the instant deployment of predictive solutions. It is the de facto standard to represent predictive analytic models and is supported by the top commercial and open source statistical tools.
- Representing predictive solutions in PMML: Move from raw data to predictions (Alex Guazzelli, developerWorks, September 2010): Learn how PMML represents predictive modeling techniques. Dive even deeper into the language and explore the data representations, transformations, and functions that represent a complete predictive solution.
- The The Data Mining Group (DMG) is an independent, vendor led consortium that develops data mining standards, such as the Predictive Model Markup Language (PMML).
- Visit the Zementis PMML Resources page to explore complete PMML examples.
- Visit the PMML page in Wikipedia.
- Visit the Predictive Analytics page in Wikipedia.
- Visit the Data Mining page in Wikipedia.
- Join the PMML discussion group in LinkedIn.
- Visit the IBM developerWorks Industry Zone for all the latest industry-specific technical resources for developers.
- To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
- developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts.
Get products and technologies
- IBM SPSS Statistics 20 (formerly SPSS Statistics) puts the power of advanced statistical analysis in your hands. Whether you are a beginner or an experienced statistician, its comprehensive set of tools will meet your needs.
- ADAPA is a revolutionary predictive analytics decision management platform, available as a service on the cloud or for on site. It provides a secure, fast, and scalable environment to deploy your data mining models and business logic and put them into actual use.
- IBM WebSphere Application Server: Build, deploy and manage robust, agile and reusable SOA business applications and services of all types while reducing application infrastructure costs with IBM WebSphere Application Server.
- Innovate your next open source development project with IBM trial software, available for download or on DVD.
- Participate in developerWorks blogs and get involved in the developerWorks community.