I am a data scientist and lead development engineer working on IBM's Cloud and Smarter Infrastructure. I'm proud to be part of the team that has developed a brand new product Smart Cloud Analytics - Predictive Insights (SCA-PI), which is featuring at IBM Information On Demand 2013.
A feature of Predictive Insights is the simplicity of setup of our data-mining algorithms, which flies in the face of a problem known as “No Free Lunch”, and this is the subject of my post.
In data-mining a commonly known problem is that of the No Free Lunch, in a nutshell this states that all algorithms are created equal when applied to all possible problems. What this infers is that if you perform better in one domain, you pay the price in others. This might appear to negate the possibility a low-configuration, data-agnostic data mining solution, and so I will describe why this is not the case with Predictive Insights.
We continually strive to tune and improve the consume-ability of our product so that what you get out of the box will provide you maximum value as soon as possible. A big part of this is a data-mining system that is configuration light and works out of the box when it comes to algorithms, allowing you and your team to work on what you care about, pro-actively monitoring your systems. We say we are data agnostic in that we can absorb any performance metric and make decisions based on its historic data.
This all sounds too good to be true, right? Surely the No-Free-Lunch scuppers our intents? In data mining almost all algorithms have “magic parameters” that are tweaked and tuned to make a algorithm perform best, surely that implies massive configuration?
In the problem domain of performance management, IBM has the dream team to develop this solution as we are a company of people, IBM'ers, that have the perfect mix of experience in development, systems, research and analytics to monitor systems at scale.
A nightmare for any data scientist is a dump of data with the task of “go find stuff” with no contextual information. The corresponding dream is a close partnership between a broad group of domain experts learning from the data combined with the development of algorithms that make the data sing, or in the case anomaly detection squeal, when it sees trouble ahead.
IBM has been in the electronic computer business from the very early days, releasing the commercial 701 electronic computer in 1954. IBM's Cloud and Smarter Infrastructure division has an unparalleled history in, and domain knowledge of, monitoring systems and applications. To go with this, we IBM, are in my opinion the biggest analytic's company on the planet; we have an incredible living history in analytics with headline showcases such as Deep Blue and Watson, but also the software to go with it.
The Ultimate Recipe
The real data mining problem we are addressing here is that while we have expertise on the metrics we monitor, we are engulfed by the volume of these and desperately need to translate our domain knowledge into a set of relevant algorithms and features to be our eyes and ears for us. To complicate things, when we are given a metric and its associated label, we don't necessarily know the type of metric (for example CPU %, Memory utilization) it is in a directly parse-able way as every monitoring system labels its metrics in its own way (we can absorb metrics from any system). So what we are faced with is an incoming stream of metrics, with associated tags (CPU, came from IP address X, etc), and we need to unravel this and work out how to automatically apply our domain insights to these metrics.
A first important step has already occurred, we have narrowed our vision onto metrics that are read in a time ordered fashion. This is a vital piece of information, and its the start of our path. One observation on a KPI is followed in time by another observation, and obviously this domain knowledge is baked into the heart our algorithms.
Another vital part is that we are monitoring large scale performance management systems. This is another crucial part of the puzzle. We have all experienced these systems, examples abound. Systems that live behind load balancers can portray correlation. Systems that are used by end users portray the beat and hum of human activity, our seven day, 24 hour working cycle. Batch processes are run by cron jobs at specified times, when data is transferred from one machine to another, bytes out of one machine directly effects the bytes in on the receiving machines. This list goes on, but behind every story we know in our history of monitoring performance metrics is a pattern we can mine.
Lunch is served
Suddenly we have gone from a data blind domain with no information about what we monitor to one where we have a rich depth of information with contextualized patterns and signals. We are now hunting with x-ray vision. The years of experience IBM's Cloud and Smarter Infrastructure division has of monitoring systems and applications have been married with data mining to drive features and algorithms. This combination enable IBM's experience in monitoring to be encoded into an automated insight engine.
One of the many examples of this extraction of domain knowledge is that our algorithms use the insight that in computer systems a change in the value in one metric can affect or cause at a later time a change in the behavior of another metric. Clive Granger won the Nobel prize in economics, he devised a statistical test to determine if one time-series has an impact on another time-series in the future. This test is known as Granger causality. By using our domain knowledge, observing that metrics can have casual effect on one another we have worked with the IBM Watson research team to incorporate causality detection into the way we look at metrics and uncover faults.
This enables IBM's experience in monitoring to be encoded into an automated insight engine, which is equivalent to a master chef using their experience to cook your data mining free lunch so you can focus on proactively monitoring your environment.
We hope you enjoy your proactive data-mining “free-lunch” our new product IBM's Smart Cloud Analytics - Predictive Insights (SCA-PI), but the final piece of the puzzle is you, our customers. No solution is ever perfect, and so we want to work with you in an agile partnership to make our product better and find the problems that your feel are most pressing, so please tip the chef by using our agile feedback loop.