A look at managing an incident both with and without Watson AIOps.
Chapter 5 of "The Art of Automation" presented the vision of AI in IT Operations management and the techniques that are developed at IBM as part of the IBM Cloud Pak® for Watson AIOps product. In this article, we will illustrate some of the capabilities of AIOps and the way they can (and will) change the way we solve production problems.
Incident Management is the practice of restoring a damaged service as quickly as possible. In this post, we’ll describe Incident Management process twice — at first without AIOps and then with AIOps — in order to show the differences and advantages of AIOps.
The process will flow along a timeline from left to right.
Without AIOps: Before the incident occurs
The first sign that a problem is going to occur can often be found in logs. Unfortunately, these early signals are usually used for diagnostics and investigation after incidents have occurred and not to avoid them in the first place.
Another problem with log collection is that it takes a significant effort to configure the logging collection solutions to understand the meaning of logs — which fields represent what, which messages represent correct behaviour and which logs signify real problems. It’s usually a manual task to perform this parsing and analysis.
The next sign that a problem is about about to occur will be from the various metrics that are collected from monitoring systems. The problem here is that most organizations have many monitoring solutions, each one responsible for collecting information from a specific aspect of the environment — from infrastructure through networking and storage to virtualization and orchestration, from operating systems through middleware to application performance management. Included here are traces as a special type of metric — a kind of combination metric and log. In any case, a tremendous amount of information is collected and it’s often siloed off into different problem domains instead of a holistic and overall view.
Often, so much information is collected that first responders end up drowning in a sea of false-positive and false-negative signals. These have to be adjusted manually, which is time consuming and lowers the trust in the solution. This is one of the reasons that, while many metrics are collected before a problem arises, very few of them are actually used to avoid problems occurring.
Without AIOps: Once the incident occurs
When a problem does occur, after we’ve missed the logs and metric signals, it almost always involves a topology change of some kind — a component going up, a component going down or moving from one state to another. There are many topology tools that collect information, but again, most of them are tied to a specific level — perhaps the network layer, maybe the virtualization layer, perhaps the application layer and so on. There are also traditional Configuration Management Databases (CMDBs) that add the business relevance layer on top of the technology, but these are usually difficult to keep up to date.
Regardless, it’s a major effort to keep track of the entire end-to-end, top-to-bottom topology of the environment, let alone be sure that all the changes that represent the problem are caught by the system.
These three categories together make up the “Observability” of the service or application that is monitored. Through the collection of logs, metrics, topology and traces distributed across the many technology domains, we can understand what’s going on within the service, why it’s behaving the way it’s behaving and where the weak spots are.
This is in theory, of course; in reality, we only have a partial view of the service since it’s so difficult to collect and cross-reference all this information.
Once the problem occurs, the collected data moves from the domain of Observability (what’s going on inside the service) and into that of Availability (information about how the service is currently affecting other services and end users).
Availability information is simpler than Observability, but by the time it comes into action, the problem has already occurred.
Once the problem occurs, a notification event or alert is created and sent to first responders. Since there are so many potential data sources — many logs, multiple metric collectors, topology mappers — many events will be generated that notify us about a single incident from different view points. It can be overwhelming to try to solve so many problems simultaneously without understanding that they are actually aspects of the same one.
Without AIOps: After the incident has occurred
The documented “source of truth” regarding an incident is a ticketing system that documents what the problem is and who’s responsible for solving it. So, while the first signs of the problem showed up at at T-2 and the problem actually occurred at T0, humans actually only start to solve it somewhere between T1 and T2. Even then, it’s probably a single on-call engineer, no matter how complex the issue will turn out to be.
At this point, at T3, the engineer is finally starting to investigate the problem and is referencing the multiple topology, metric and logging solutions to try to find information. The engineer may need to pull in more people with more expertise or access to specific observability solutions depending on the complexity of the problem.
AIOps to the rescue!
However, there’s a better way. While the humans only get there late in the process, at T3, Watson AIOps can start addressing the problem much earlier and prepare a virtual war room (or story) that will have everything the engineers need once they get involved.
In fact, Watson AIOps might be able to solve the problem earlier, without involving people at all. This means better availability for the monitored services, happier end users and less work for the engineers involved.
Watson AIOps collects data from all the different sources there may be in the organization, aggregates it and makes sense it. Using AI models that detect anomalies early, Watson AIOps can find signals that represent potential future problems even earlier than thought possible. Before the problem arises in T0, before the clear metric signals in T-1, before the warning log messages in T-2, Watson AIOps can detect the subtle signs of potential problems.
Upon detecting a potential problem, Watson AIOps will leverage historical incident information and decide whether there’s an automated solution or runbook that can be used to solve the problem to nip it in the bud.
Even if there isn’t one, Watson AIOps will decide whether a human needs to be notified at this stage. If so, WAIOps can recommend next best actions to perform to make sure that the potential problem doesn’t turn into an actual one.
Watson AIOps will also use its AI models to make sense out of the multitude of logs that the system is generating. There is no need for manual parsing of different log types to get Watson AIOps to understand what a “good” log looks like versus a “bad” one. Watson AIOps uses a number of machine learning models to automatically recognize logs that are reporting service errors. Once it detects the new problems, Watson AIOps will update its recommendations on automatic runbooks, whom to notify and what the next best actions are.
By now, Watson AIOps is detecting more signals about a potential problem and using more AI models to extract more insights. Multi-variant analysis finds behavioural patterns across multiple compents and alerts us when they are no longer synchronized. This kind of model helps us understand which signals are symptoms and which point at causes. Watson AIOps is constantly building and updating behavioural baselines, and whenever a component diverges from the baseline, we have an automatic alert based on the dynamic threshold which the baseline defined.
And again, Watson AIOps is constantly updating the most correct runbook and next best actions in the story or ”virtual war room” that it has created.
By including the end-to-end topology of the environment in its models, Watson AIOps can map the blast radius or “area of affect” of the incident. This can help with weighted analysis to prioritise between differernt problems in different components or to understand that there is overlap between different incidents (or stories) and that they should be merged.
At this point, Watson AIOps is not just saying “there is a problem,” but rather pin-pointing where it is and what it’s affecting.
If the problem is truly unique (or does not have a pre-existing solution) and Watson AIOps has not been able to resolve it before it impacts the service, then alarms start being generated from the monitoring solutions. Without Watson AIOps handling the events, operators and engineers will be innundated by the multiple events that have been generated by the different monitoring solutions — each looking at the same problem from a different angle. Not only does Watson AIOps aggregate and group all the events into a managable set, it will also help make sense out of them by doing things like detecting recurring patterns (e.g., this problem occurs every Tuesday at 3pm or that problem occurs every second Wednesday of the month). It will also place all the events in a hierarchy of probable causes so it is easy to know where to start with troubleshooting efforts.
And again, as the timeline advances, Watson AIOps keeps a constant record of what it has detected and what has been affected and presents a set of recommendations for how to solve the problem.
Once the new incident results in a new ticket being opened, Watson AIOps cross-references it with the historical incidents to continue to improve its recommendations regarding the solution of the problem. While the ticket is allocated to the first responder, Watson AIOps starts inviting more and more people to view the story of the incident that it has been constantly updating.
At this point, Watson AIOps will have build the entire story of the incident from T-3, well before there was an obvious sign, all the way to T3, where humans are collaborating together to solve it.
At every step along the way, Watson AIOps could have avoided or solved the problem if there had been a relevant runbook. If there was no runbook available, Watson AIOps would have informed the relevant humans much earlier than they would have known about the problem using traditional means — and with more tools at their fingertips to help solve the problem.
Even if none of these things solved the problem by the time we reach point T3 in the timeline, the fact of the matter is that Watson AIOps has already collected all the information the humans might need, aggregated and prioritised it, teased out the important insights and made it available in a simple consumable manner. This means that solving the problem, even if Watson AIOps hasn’t done so by itself, will be that much simpler.
The IBM Cloud Pak for Watson AIOps detects problems earlier, tries to solve them earlier, notifies people earlier, generates insights out of the masses of available information, cross-references with historical information and presents all this in a simple and easy manner — allowing the humans to concentrate on solving the problems instead of trying to make sense of the information.