Site Reliability Engineers (SREs) face an ever-increasing wave of complexity as environments continue to grow in both size and inter-connectedness.
Each environment has its own technologies and its own monitoring tools.
Watson AIOps already monitors both cloud native and traditional environments, where it gives the SRE a holistic diagnosis using artificial intelligence (AI) to intelligently interpret structured data (e.g., metrics), semi-structured data (e.g., application logs), and unstructured data (e.g., ticket data). It detects anomalies in both metrics and logs as they occur, groups multiple signals together to avoid alert storms, uses a single topology to show both the cause and impact areas, and recommends the next best actions from existing tickets.
The mainframe, and its Z ecosystem, is a common blind spot for many organizations that don’t have the ability to understand what’s happening in their IBM Z environment and how its related to workloads running in their hybrid enterprise. The new Watson AIOps for IBM Z solution leverages AI to pull back the veil on the mainframe, providing the SRE with visibility into the Z world. SREs can now leverage alerts and events on metric and log anomalies from Z and across their enterprise and correlate them together for incident and impact analysis. This enables rapid service restoration and the ability to proactively remediate incidents before they cause a service disruption.
Mainframe data crucial to achieving end-to-end visibility
Below is an architectural view of Watson AIOps for IBM Z and how the IBM Z ecosystem integrates into Watson AIOps to enable end-to-end visibility for incident management:
One really has to look at two different areas of the architecture to understand how Z enables its participation in AIOps: connectivity and AI.
Each of the components in Watson AIOps for IBM Z was designed with open data connectivity in mind. As such, this is simply a matter of connecting different components from the enterprise ecosystem into Watson AIOps. Logs will be aggregated from multiple sources, your cloud native and traditional applications are likely already feeding their logs into a common log store (e.g., Humio or Elastic), and your IBM Z logs can be streamed by IBM Z Operations Analytics into Watson AIOps. Vital events/alerts from IBM Z components are easily consumed by Watson AIOps for Z to provide crucial information as part of incident analysis. For example, alerts from IBM Z Operations Analytics for when key Z metrics have gone anomalous will be sent as events to Watson AIOps. Similarly, OMEGAMON alerts on threshold breaches will also be sent as events and consumed by Watson AIOps.
Flexibility and open data connectors, it’s a beautiful thing.
Complete the picture with AI
The other side of the coin is the AI. Watson AIOps for IBM Z brings together leading AI technologies from Watson AIOps and IBM Z Operations Analytics. Combined together, they produce a real multiplier effect, where the algorithms from IBM Z Operations Analytics find anomalies on key Z metrics using machine learning and AI.
The system has an understanding of what normal operations look like for your Z environment (by understanding how key Z metrics normally behave on days when the system is running smoothly) and is able to detect when something on Z is starting to move away from what’s normal (one or more Z metrics are acting differently than what’s normal for that timeframe). These insights are then consumed by Watson AIOps, where they are combined with other events and log anomalies.
For the log anomalies, the AI in Watson AIOps first classifies each log message into clusters specific for that given application using various natural language processing (NLP) techniques, then builds a baseline by learning which sequences to expect in normal and abnormal situations. This allows it to work with any application, both off-the-shelf products and custom-developed ones. There is even a separate AI model that learns how to describe these given clusters so as to better explain its findings.
Add in the topology of Z resources and relationships feeding into Watson AIOps to be knitted together with discovered topology across the rest of the enterprise, and the SRE has a complete picture to understand the impact radius for an incident (i.e., what are all the components across your enterprise that may be impacted if the incident is not resolved).
Once the SRE understands the problem, they next need to identify the best way to resolve it and get the application back online. Turning everything off and then back on isn’t always the best approach here. Watson AIOps uses the information that is locked away in your historical incident tickets to assist with the best course of action. Instead of doing a normal keyword search, Watson AIOps uses NLP techniques to understand the intent of each ticket and then searches these to find the best set of similar incidents. It then proposes the best actions to take next by distilling the resolution for each ticket into an entity/action. The result is that the SRE can leverage how the same or similar issues have been resolved in the past, and they can follow the same runbook or set of steps to resolve the issue.
Bring it all together
Let’s take a concrete example that puts it all together. You have a distributed application that uses a web server frontend and leverages components on IBM Z to provide backend transaction processing. Watson AIOps for Z has visibility into the supporting Z components, such as the Db2 and CICS subsystems. Alerts/events for anomalies on logs and key metrics for these Z subsystems participate in the incident analysis along with the alerts/events on anomalies on the distributed side, such as anomalies identified in the application log. The SRE can look at all of the anomalies and topology included in the ticket to understand what’s happening across their enterprise and determine where the issue lies, allowing them to resolve the issue quickly to restore service or remediate the issue before service is even impacted.
The result is that Watson AIOps for IBM Z puts advanced AI and incident analytics together, saving the Site Reliability Engineers time and money by diagnosing and resolving problems quicker, across a continuously evolving hybrid landscape.
For more information about this exciting new solution that gives you visibility across your enterprise, see our Watson AIOps page.