November 17, 2011 | Written by: Alfredo Olivieri
Share this post:
In my previous life as system management specialist, I often happened to warn my customers about “infrastructure management noise,” an issue arising from setting too many monitor thresholds and generating therefore large volumes of events, which makes difficult to identify real problems. You often end up with so many alarms that people simply start to ignore them.
Next step, therefore, is often trying to correlate events so that you can isolate actual problems and identify their root causes. This step can be particularly difficult in complex environments, and can require significant expertise regarding the relationships between the metrics you are monitoring, especially if you want to proactively identify the issues that are likely to affect your business.
So, what does events-correlation have to do with the cloud? Well, you probably agree that monitoring your cloud resources is vital for your cloud services to be provided successfully. Anyway, although virtualization delivers a number of benefits, it also increases the complexity of the environment you have to monitor; it is therefore more difficult to understand whether a performance issue is because of the network, the storage, the hypervisor, the physical machines, the virtual machines, the middleware, or the applications installed on top of them. Moreover, whether you are a cloud service provider, with tight service level agreements (SLAs) regarding the services you offer to your customers, or you have to manage your private cloud, where any disruption can directly impact your business, you could be interested both to effectively correlate the alarms coming from all the monitored components, and to detect potential performance issues proactively, as much as you can.
Can standard correlation help you? Probably, if you have enough expertise to understand metrics relationships and human resources to effectively correlate the related events. Is it the best approach? Probably not, according to what I recently learned regarding IBM Tivoli Analytics for Service Performance, announced as part of IBM cloud launch and expected to be available sometime in the first half of 2012. Of course, because the product has not been officially released yet, this blog entry cannot describe its features in details. Anyway let’s try to understand at least its philosophy, keeping in mind that nothing is written in stone until the product is generally available.
Instead of only correlating events, Tivoli Analytics for Service Performance uses performance and metric data that is collected from the existing IT monitoring and performance management solutions on which the customer has invested, to apply advanced analytics technologies that can discover complex relationships between metrics and learn the normal operational behavior of IT environments. Based on this learned behavior, it is able to detect anomalous conditions indicative of faults and can provide a prediction of when problems will become service-impacting.
Let’s try to better understand it with a simple example. Assume you want to monitor the health of a car engine using basic metrics, such as engine temperature, oil pressure, battery, and so on. Using standard correlation, also called the univariate approach, each metric is actually considered in isolation. For example, the following picture shows how two problems that occur simultaneously (such as blown oil gasket and battery loses charge) can generate three alerts that you will then need to correlate: a threshold violation for engine temperature, a violation for oil pressure, and a violation for the battery.
With the multivariate approach, the relationships between the preceding metrics have been discovered in advance by analyzing their behaviors; based on such knowledge, any deviation from expected metrics relationships generates alerts. In this case, correlation is automatically performed at the metrics level, and alerts are generated only for real problems as shown in the following picture.
Furthermore, by detecting the deviation of metrics that normally move together, the multivariate approach helps to detect problems sooner. For instance, the engine temperature and engine revolutions normally move together, which is healthy system behavior. However, when engine temperature deviates from engine revolutions, as can happen with coolant leak, this indicates a problem and therefore an alert is generated. The following picture shows the various behaviors for univariate and multivariate approaches. In the first case, on the left, a static threshold is set on the engine temperature and an alert is generated only when the temperature goes above it. In the second case, on the right, the relationship between temperature and revolutions is monitored: when the relationship differs from what normally is expected (that is, when the temperature starts rising while engine revolutions decrease) an anomalous situation is detected and an alert is generated much earlier than the static threshold set in the first case.
Now, let’s look at the components that make such an approach feasible, by analyzing Tivoli Analytics for Service Performance architecture.
The following picture includes
- A highly scalable and flexible data mediation layer, which provides turn-key integrations and easily extendable capabilities to collect both real-time and historical data through a large framework of connectors
- A set of powerful analytic algorithms, combining univariate and multivariate approaches, based on IBM InfoSphere Streams (a streaming analytic engine resulting from five years of IBM research) which provides highly scalable engine (million of events a second) and can perform incremental analysis on each data element as it arrives, while contemporary learning and refining its model accordingly (one instance learning, one running)
- A presentation layer that allows representing analytics data in Tivoli Integrated Portal (TIP in the picture) and visualizing predictive events in Tivoli Netcool OMNIbus or in thirdparty event consoles through SNMP.
You probably agree with me that this technology promises to be interesting, and that can find application in several areas. But, let’s focus again on the cloud for a final example.
The following picture shows a simple architecture supporting two cloud-hosted services (online statement and payment transfer) in a virtualized environment.
Imagine that the default settings for virtual machines have been used across the hypervisor and that the interdependency between supporting virtual machines on the same hypervisor are not modeled. In this case, if the load of one increases, it is likely to interfere with others, and there might not be related alarms, unless you are capable to model all the possible interdependencies (not easy in a dynamic environment as the virtualized ones are). An alarm might be generated for “Payment Transfer” poor performance (highlighted by a red circle in the picture), but the relationship to its root cause (for instance WebSphere load) might not be obvious. Tivoli Analytics for Service Performance can instead identify such an anomaly in advance, highlighting that “Payment Transfer” low performance is related to anomalous metrics relationships between WebSphere, VM 12, VM 22, and Hyper 1 nodes. As it is clear in this case, anomaly information and relationships can be used to augment Business Service Management models to help diagnose complex problems quickly, exposing relationships that are outside service model definitions during anomalous conditions.
As soon as Tivoli Analytics for Service Performance is available, I will take a deeper look to understand if and how much it keeps its promises. 😉 In the meantime, I am very interested to hear your opinion about its approach. Do you think it can really help? What are the other areas in which it could be used? Do you see any drawback? Leave a comment to this blog entry and we will try to deepen the topic together.