Using operational policies to implement a sophisticated and heavily automated operations management workflow.
Whether you have a fully automated IT operations environment that is complemented with advanced AI or rely entirely on manual triage and remediation, a large part of operations management is answering questions and making decisions. For example:
- What does an event represent?
- Does it indicate a fault?
- If it does, what is the fault?
- What action should be taken to fix it?
The faster and more accurately you can answer such questions and make decisions, the more effective you will be at ensuring your customers are not impacted by service disruptions.
Making decisions at super-human speed
With traditional IT Operations, many of these decisions are manual; events are received and then triaged by operators in order of severity. Then it is up to first-line operators to quickly understand each event, determine its priority and resolve it themselves, if possible. Where this is not possible, they must determine who is best placed to investigate it and raise a ticket against the appropriate team.
This process becomes increasingly difficult to manage as an organisation’s IT infrastructure expands and the number of events grows exponentially.
The logical response to this is to try answering questions and making decisions ahead of time. For example, if you know that events from a given router are owned by a specific team, tag those events with the name of that team so the operator knows where to route them to. Going one step further, start automatically raising tickets against that team, so the operator doesn’t have to triage the event at all. By capturing this knowledge ahead of time, you are reducing the time your team needs to spend gathering information after a problem has already occurred. This will result in a much faster mean-time-to-resolution.
Bringing artificial intelligence (AI) into the picture means that not only can some decisions be made ahead of time, but the system can also determine possible solutions without any manual input or configuration. The decisions made are the same as before — the only difference is that a human doesn’t need to make them.
Capturing those decisions in an understandable way
When your business depends on it, being able to understand the decisions being made in your operations environment is incredibly important. This holds true for both decisions made manually and by AI. There is, therefore, a need for a central location where all these decisions can be understood, reviewed and, in some cases, modified.
Within IBM Cloud Pak® for Watson AIOps, we call a set of these decision points an Operational Policy. Regardless of whether these decisions are AI-generated or defined by users, Operational Policies should capture and explain these decisions in the same way. One major advantage of this approach is that you don’t have to have data science expertise to understand what the AI is doing.
What capabilities are available now
Policies are not a new concept; similar capabilities are available today within the Event Manager module of Cloud Pak for Watson AIOps and within Netcool Operations Insight. These existing capabilities, listed below, provide powerful ways to automate your operations management decisions:
- Probe rules files allow you to express decisions about which events are important enough to show to operators and what information should be presented to them at the point of collection.
- Object server automations allow you express decisions about alerts and how an operator interacts with them; including decisions that require information from multiple alerts.
- Impact policies allow you to gather information from a diverse set of data sources and use this to make automated decisions about your operations environment. They also allow you to make use of outbound integrations (e.g., automatically raising tickets or running automations).
- Cloud Native Event Analytics policies (and the earlier incarnation On-premises Event Analytics policies) are automatically created by AI and act to make decisions about which alerts are caused by the same issue (correlation) and which alerts follow a seasonal pattern.
Making use of these capabilities, it is already possible to implement a sophisticated and heavily automated operations management workflow.
Looking to the future — centralised, easy-to-use policy management
While the existing capabilities are very powerful, there is still room for improvement. There is currently no central place to manage and review these decisions, and today, this activity requires you to switch between several different tools.
However, over the next few releases of IBM Cloud Pak® for Watson AIOps, we are building a brand-new policy system which takes the best parts of our existing technologies described in the previous section and puts them all in one place. In future blogs, we will describe how this policy system can resolve real-world operations management scenarios.