Improving Service Availability with Analytics
EdieStern 270005MBF4 Visits (3635)
The state of the art in managing IT has left us increasingly rich in data. We monitor physical and virtual infrastructures. We monitor middleware, applications and end user responses. We even sniff packets. Additionally, we have events, log files and core data to deal with. A typical 5,000 server data center generates about 1TB of unstructured data and over 30GB of structured data every day. There's far more data than anyone can manually interrogate. Luckily, analytics gives us the tools to leverage this sea of data to improve the way we manage it.
Using analytics for operations is not a new idea. We've been using analytics for years for capacity planning, for performance management, and for the creation of dynamic thresholds. What is new is that sophisticated analytics can provide more and deeper insights, operating only on the data already created as part of operations management in the data center. It's a perfect match with one of the biggest pain points in operations: detecting and predicting problems, and quickly isolating their root causes.
Think about the history of data center problem detection and isolation. It has continually evolved, and become more and more sophisticated. We started with "management by trouble ticket", responding to irate calls from users. Event management came next, allowing operations to get a head start on problems by receiving and correlating events and traps that signalled issues. The next challenge was performance. We monitored metrics such as CPU utilization and network throughput and created alerts when the metrics exceeded a preset threshold. Over time, monitoring of these Key Performance Indicators or KPIs became more useful, with thresholds set dynamically based on historical performance of the metric. That's where most companies are with their management today. In addition to event management, they monitor performance metrics and use threshold crossings to understand when problems are developing. Time for another step. Analytics is taking us to the next stage.
Let's take just one example. The historical approaches look at each metric in isolation. In addition to monitoring and analyzing each metric individually, analytics today lets us look at the performance KPIs together and understand what normal patterns of behavior look like. That is, we can derive mathematical relationships between KPIs. For example, perhaps under normal conditions, KPI 1 and KPI 2 go up and down together, or that KPI 1 goes down when KPI 2 goes up. By using behavioral learning to understand what normal patterns look like, no human has to cope with modelling the incredible complexity of a large production data center busily running a large number of complex applications. When analytics provides this kind of a model, the next step is to score new performance data against that model and understand when those patterns are broken. When a pattern is broken, then we've identified an anomaly. This is a very effective way to detect emerging problems in the data center. When the anomaly is detected, remediating action can be taken and thereby prevent a disruption in service.
Increasingly, management is not just about the data you collect. It's what you do with it.