We envision fully instrumented, observable, self-aware, automated and autonomic IT operations environments in the future. AI can help us get there.
Covered in this chapter
- The vision of an autonomic (self-aware, self-healing, self-managing) IT system
- An introduction and holistic approach to IT Operations management
- How artificial intelligence is used in ITOps management to transform it to AIOps
- The future of AIOps expands (“shifts left”) operational efficiencies to also include the software delivery lifecycle
The Art of Automation: Table of Contents
Progressing toward an autonomic IT system
The vision of self-aware, self-healing and self-managing Information Technology (IT) systems has remained elusive till recently. Recent advancements in cloud computing, natural language processing (NLP), machine learning (ML), and artificial intelligence (AI) in general are all making it possible to realize this vision now. AI can optimize IT operations management processes by increasing application availability, predicting and detecting problems early, reducing the time to resolve problems, proactively avoiding problems and optimizing the resources and cost of running business applications on hybrid clouds.
In this chapter, we detail the opportunity for AI in IT Operations management and the techniques that we are developing at IBM as part of the IBM Cloud Pak® for Watson AIOps product. We will describe how semi-structured application and infrastructure logs are analyzed to predict anomalies early, how entities are extracted and linked from logs, alerts and events to reduce alert noise for IT operations admins, how NLP is put to use on unstructured content in prior incident tickets to extract next-best-action recommendations to resolve problems and how deployment change request descriptions are analyzed in combination with past incident root-cause information to predict risks of deployment changes to prevent issues from happening in the first place.
IT Operations management
Information Technology (IT) Operations management is a vexing problem for most companies that rely on IT systems for mission-critical business applications. Despite the best intentions of engineers, good designs and solid development practices, software and hardware systems deployed in companies in service of critical business applications are susceptible to outages, resulting in millions of dollars in labor, revenue loss and customer satisfaction issues each year. According to a recent Forbes article, every year, IT downtime costs an estimated $26.5 billion in lost revenue based on a survey of 200 companies across North America and Europe .
The best of the analytical tools falls short of detecting incidents early, predicting when incidents may occur, offering timely and relevant guidance on how to resolve incidents quickly and efficiently and helping avoid them from recurring. This can be attributed to the complexity of the problem at hand.
Data volumes continue to grow rapidly as companies move to modular microservices-based architectures, further compounding the problem. Gartner estimates that the data volumes generated by IT infrastructure alone are increasing two-to-three-fold every year. Furthermore, the heterogeneous nature of environments — where companies’ IT applications can run on a mix of traditional bare metal, virtual machines and public or private clouds operated by different parties — adds to the complexity and scale that IT Operations management solutions must deal with.
To add to this, IT applications, the infrastructure that they run on and the networking systems that support that infrastructure all produce large amounts of structured and unstructured data in the form of logs, traces and metrics. The volume and the variety of data generated in real-time poses significant challenges for analytical tools in processing them for detecting genuine anomalies, correlating disparate signals from multiple sources and raising only those alerts that need IT Operations management teams’ attention. Having the best of the breed IT operations management tools is necessary, but not sufficient, for effective problem resolution. Such complex and dynamic environments demand a new approach to IT Operations management that is smart, intelligent, real-time, adaptive, customizable and scalable.
Artificial intelligence (AI) can help solve these problems. AI can help IT Operations administrators, also known as Site Reliability Engineers (SREs), in detecting issues early, predicting them before they occur, reducing event and alert noise, locating the specific application or infrastructure component that is the source of the issue, determining the scope of incident impact and recommending relevant and timely actions. All these analytics help reduce the mean times to detect (MTTD), identify/isolate (MTTI) and resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (e.g., lost revenue, penalties, opportunity costs, etc.) and indirect costs (e.g., customer dissatisfaction, lost customers, lost references, etc.).
The rest of the chapter is organized as follows. First, we describe our holistic approach to IT Operations management, elaborating on the broader opportunity for AI to optimize various problems in the domain. We then focus specifically on what we at IBM are doing to achieve the vision of self-healing, self-managing and self-monitoring IT systems. Finally, we conclude by reiterating our vision and the opportunity at hand.
A holistic approach to IT Operations management
AI enables us to take a holistic approach to IT Operations and service management [Figure 1]. We elaborate our vision for applying AI to optimizing IT operations management. This is referred to as AIOps:
From structured to structured, semi-structured and unstructured
Traditionally, the primary approach to addressing IT Operations issues has been by monitoring the metrics, which are structured data. However, unstructured data like logs and prior incident ticket data (which is semi-structured and unstructured) can help detect issues early and resolve problems based on prior resolutions. The rise of artificial intelligence (AI) — powered by the advancements in hardware architectures, cloud computing, natural language processing (via language models like BERT [J. Devlin et al., 2018] and Fasttext [R. Seyed et al., 2017 ]), machine learning (via deep learning (DL) algorithms and frameworks like Tensorflow and Pytorch) and deep neural network architecture optimization frameworks (like Katib) — has opened up new opportunities for process unstructured data.
We can now pre-train features in multiple languages using language models [X. Liu et al., 2020]. We can extract noun-verb phrases from prior ticket incidents to identify resolutions [L. Chiticariu et al., 2010]. We can apply semantic parsing techniques to extract key terms and phrases to derive runbooks [P. Mohapatra et al., 2018] [A. Gupta et al., 2018]. Using these latest NLP techniques, we can now untap the potential from logs and tickets in IT Operations domain to detect signals and problem resolutions.
From siloed signals to integrated context
By extracting mentions of the problem components (e.g., entities like application names, server names, pod ids, node ids, etc.) from various structured and unstructured data, we can connect the dots across IT data and create a holistic problem context. When combined with topology and causality reasoning, this correlation of data across various signals can help us create a full picture of the context around a problem, thereby facilitating better problem resolution.
From reactive to predictive and proactive
“Prevention is better than cure,” according to an old proverbial saying. In our view, IT Operations management and service management must include not only monitoring of business applications and optimizing incident management and problem resolution but also designing IT systems, applications and services, building them, testing them and deploying them with highest quality possible so as to avoid problems in the first place. In essence, design to operate better from the get-go.
We envision various stages of IT application development processes and tools for coding, building, testing, deploying and monitoring to be equipped with AI-infused smarts to guide developers, testers, deployment engineers and IT Operations engineers (also referred to as Site Reliability Engineers/SREs) to write secure, stable and scalable software to start with.
If problems were to still trickle through — which they might, as it may not be possible to catch every problem during code, build, test and deploy — we envision catching them at the end of each stage via risk prediction models to prevent poor-quality artifacts that do not meet the preset quality criteria from getting promoted to the next stage. For example, smart checks and gates block code with risky security vulnerabilities from getting to the deployment phase, stop under-tested code modules from getting into deployment phases, prevent risky deployments from getting pushed to production, and so on.
We envision our AIOps solution correlating past incidents with root causes that could be traced to security vulnerabilities, poor code test coverage and under-tested deployment changes. This information, when fed back, serves as critical input to reinforcing the checks and gates in the earlier stages of the DevSecOps lifecycle, as shown in Figure 2:
In Table 2‑1, we describe some of the analytics that AI can drive in incident management use cases:
First-generation AI model management to advanced AI model management
Prediction models built using machine learning are bound to make prediction mistakes. They need to learn on the job and keep improving. It’s one thing to build AI models and to deploy them in production, but it’s whole different thing to build them in such a way that the models are learning continuously and improving from fresh, fair, balanced and unbiased data taking user feedback at as it comes. For this, AI models need to conduct disciplined error analysis from each iteration. Having an AI platform that supports the management of the lifecycle of AI models is critical so that AI models are fresh and relevant.
Such a platform should support both the data scientists that build the initial models and AIOps product and IT Operations tool administrators that have to maintain these AIOps tools in production. These IT Operations tool administrators are not data scientists; so, care must be taken to ensure that the part of the AI model lifecycle management platform that gets exposed to them doesn’t expect them to be data scientists.
An AIOps platform must be setup to learn continuously by using up-to-date data from your environment and to improve based on user feedback. In addition, AIOps products can’t be Blackbox solutions. Companies deploying AI models demand full transparency of the inner workings of our AI models for various reasons, including legal and regulatory concerns. IT Operations products employing AI models should be setup to give IT Operations administrators access to AI models for triggering retraining for examining model performance on demand, even while having provisions for automatic retraining on a regular basis.
Discrete human handoffs to natural human-AI collaboration
As we discussed in the motivating scenario, we believe that delivering insights where people do their work avoids multiple unnecessary tool hops and disruptions for users. Believing in this principle that has been validated with user testing, we deliver insights both in a dashboard as well as in ChatOps environments (such as Slack and Microsoft Teams). Users can switch back and forth seamlessly from dashboards to ChatOps environments without tool hopping.
Compliance by design
We understand that companies need to be able to set their policies and preferences and have the AI and automation honor them. We envision AIOps products to have a flexible policy management framework using which users can specify their policies, rules and preferences for guiding the AI and insights. For example, if a certain type of events doesn’t need to be raised as alerts to users, one can specify those policies in the system. Similarly, certain type of alerts that are self-resolving don’t need to be raised up to the level of an incident.
Journey to automation
AI-powered Automation doesn’t have to be an all-or-nothing phenomenon. Some things can be automated fully; some things may need a human in the loop until trust can be established with automation. Whatever the case may be, we believe having a solid foundation in a platform that supports automation is an essential part of AIOps. Automation functions, include supporting runbook automation, process mining and analysis and robotic process automation (RPA), are all integral parts of this automation platform. Building on top of such a platform enables us to elevate and connect AIOps insights with the business processes and applications that they monitor and support.
In the rest of the chapter, we give a brief summary of the AI that we are building into IBM’s product, IBM Cloud Pak® for Watson AIOps.
The AI in Watson AIOps
We are on a journey to realize the vision of AIOps mentioned in this chapter for solving the vexing operations management problems for IT operations engineers. This includes building the various AI analytics described in Table 2‑1. Our journey includes the development of a product called Watson AIOps for bringing what AI can offer to the forefront in optimizing IT Operations management.
AI pipelines in Watson AIOps [Figure 3] are designed to help SREs in detecting issues early, predicting them before they occur, reducing event and alert noise, locating the specific application or infrastructure component that is the source of the issue, determining the scope of incident impact and recommending relevant and timely actions. All these analytics help reduce the mean times to detect (MTTD), identify/isolate (MTTI) and resolve (MTTR) an incident.
Anomalies are predicted from logs and metrics using log and metric anomaly prediction AI models. The predicted anomalies and other events and alerts that are generated in an IT environment are grouped into their corresponding incident buckets by leveraging various techniques, including entity linking and spatial, temporal and topological algorithms to reduce event noise. This is done by Event Grouping AI models. Faults are diagnosed and localized by Fault Localization AI models. The set of impacted components are noted by Blast Radius AI models. Similar incidents from the past incident records are identified and next-best-actions are derived by Incident Similarity AI models. Finally, problems are avoided by predicting risks associated with deployment and configuration changes via Change Risk Prediction AI model. We present a brief glimpse of how these AI analytics can be realized:
Log anomaly prediction
An anomaly is something that deviates from normal, standard or expected behavior. Typically, organizations set either static thresholds or manual rules to define and manage deviations from the normal behavior. The problem with status thresholds is twofold:
- It takes a long time for subject matter experts (SME) to distill them from their experience and to create them.
- They don’t adapt to changes and, therefore, tend to get outdated and irrelevant quickly.
If not updated or deleted, these manual rule-based anomalies can start to flood SREs with irrelevant alerts. We use deep learning algorithms to both prepare features from logs during log parsing and to make anomaly predictions. Users don’t have to set static thresholds or manual rules to detect anomalies.
Metric anomaly prediction
Watson AIOps’ metric-based anomaly detection analyzes metrics data from various systems (e.g., New Relic, AppDynamics and SolarWinds) to automatically learn the normal behavior of metrics in your company and detect anomalies from those metrics. It employs a set of time-tested time-series algorithms (e.g., Granger Causality, Robust Bounds, Variant/Invariant, Finite Domain and Predominant Range) to capture seasonality and significant trends and to perform forecasting.
An event indicates that something that is noteworthy has happened in an IT Operations environment. For example, an application has become unavailable or a disk is full/reaching capacity, etc. The goal of event grouping and classification is to reduce the noise for IT Operations management personnel and help them focus on a few important events that need their immediate attention. Anomalies detected from metrics, logs and tickets are grouped using multiple algorithms (e.g., Temporal, Spatial and Association Rule mining) in Watson AIOps for event grouping.
Static and dynamic topology management
Application and network topology refer to a map or diagram that lays out the connections between different mission-critical applications in an enterprise. Static topology refers to a map that is constructed based on the build and deploys information on applications and infrastructure components. Dynamic topology, on the other hand, refers to a dynamic map that captures the resources and their relationships as the environment changes at run-time and provides a near-real-time visibility of the same.
With Topology Manager in Watson AIOps, you can compare the current topology with a historical one to answer questions such as “What happened?” and “What’s happening now?” It helps you investigate the details that led up to an incident and see the topology (and status) changes over time. In addition, faults are localized on topology.
Fault localization and blast radius
Entity mentions are the names of the resources (e.g., service or application component names, server names, server IP addresses, pod IDs, node ID, etc.) that are referenced in anomalous logs, alerts, tickets and events. Once events are grouped, entity mentions in anomalous logs, metrics, alerts and events are extracted. These entities are resolved with topological resources to isolate the problem and to place the identified entities on the corresponding dynamic topology instances that match the time at which the mentions were noted. Traversing the topological graph in the application, infrastructure and network layers enable us to map out the impacted components, known as blast radius.
Watson AIOps ingests and mines prior incident ticket data by connecting to tools like ServiceNow to provide timely and relevant next-best-action recommendations for the currently diagnosed problem at hand. Current incident symptoms are framed as a query to the indexed ticket data to not only search and retrieve the top k relevant prior incident records, but also important entity-action (aka noun-verb) phrases are extracted from each relevant record to make it easy for SREs to get a quick glimpse of the suggested action. We apply various natural language processing techniques to extract entity and action phrases, including rule-based systems.
Insight delivery and action implementation
In Watson AIOps, all of the insights described above are delivered by via both ChatOps and dashboards. Real-time, in-the-moment insights are delivered via ChatOps to SREs directly in the place where they work. Within ChatOps, there is functionality to interact and share selected incident resolution suggestions with other collaborators, in addition to exploring the evidence of the insights. From ChatOps, SREs can launch log, metric and ticket monitoring tools to explore further details. Similarly, SREs can launch interactive dashboards for detailed exploration of events, event groups, metric anomalies and topology. Applicable actions/runbooks can then be automatically run via Runbook execution.
What’s next for AIOps?
As noted at the start of this chapter, we envision fully instrumented, observable, self-aware, automated and autonomic IT operations environments in the future. AI can help us get there.
We envision that AIOps solutions will not only be able to help resolve issues in a reactive mode but help avoid issues from happening in the first place by designing the Development-Security-Operations (DevSecOps) lifecycle activities for efficient operations right from the get-go. For example, smart checks and gates prevent risky deployments from getting pushed to production, stop under-tested code modules from getting into deployment phases and block code with risky security vulnerabilities from getting to the deployment phase and so on. Thus, by instituting feedback and feedforward loops in software development lifecycles [Figure 4], we can develop full end-to-end visibility and manage IT systems better. We can’t wait to shape the future and take you all with us in this journey!
Acknowledgment of contributors
Our sincere thanks to the entire engineering, product management, design, Research, sales, and engagement services teams who have helped shape our ideas for AI operations and the product.
Make sure you check out The Art of Automation podcast, especially Episode 1, in which I sit down with Jerry Cuomo to discuss RPA.
Check out the other chapters in the ongoing series, The Art of Automation:
The Art of Automation: Landing Page