AIOps: A Path to Reliability at Cloud Scale

9 min read

How to address service reliability pain points, accelerate incident resolution and enhance service reliability with AIOps.

Building cloud native applications as a collection of smaller, self-contained microservices has helped organizations become more agile and deliver new features at higher velocity. Deployed to Kubernetes, these independent units are easier to update and scale than traditional monolithic applications.

But with this app development flexibility comes increased complexity for the operations team when problems arise: 

  • How do they narrow down the list of components responsible for an outage? 
  • What tools are best for investigating the cause? 
  • How do they correlate logs and telemetry data generated from disparate tools emitted by loosely coupled components into a single source of truth? 
  • What components are dependent on which components and where to they fit in the overall topology?

This is where AI applied to IT operations, or AIOps, enters the picture.

Quick history: How movement from traditional to cloud native development impacts IT operations

The adoption of cloud native architecture has a significant impact on the granularity of service components versus traditional monolithic architectures, but the high-level software development lifecycle remains largely the same, namely: Build, Deploy, Run, Manage. 

Below are the four (4D) primary attributes of a cloud native-based architecture in the software development lifecycle:

  • Diverse [Build]: Supports multiple programming models, language stacks and data models.
  • Distributed [Deploy]: Composed of many microservices packaged in containers and communicating over the network.
  • Dynamic [Run]: Deployed to an ephemeral cloud-based infrastructure. 
  • Decentralized [Manage]: Governed by autonomous teams who use a wide array of tools of their own choice.

The before/after of the software artifacts affected by the proliferation of cloud native architectures are depicted in the diagram below:

Figure 1: Evolving IT landscape.  

Figure 1: Evolving IT landscape.  

A developer can certainly appreciate why these qualities make their job easier: Polyglot programming language choice means they can pick the best one for the task. The de facto assumption of distributed microservices means it’s easier to develop and update services from independently working teams, increasing  the velocity of new feature development. Similarly, the expectation that the deployment environment is both dynamic and decentralized means developers are less likely to be delayed waiting for other teams to deliver.

While the above "4D" attributes help developer productivity and velocity in the build/deploy phases of the software development lifecycle, it introduces potential complexities in the subsequent run/manage phases. Let's consider the same 4D attributes, this time with the corresponding demands from an operational perspective for cloud native apps:

  • Diverse: Many technology stacks, hence more interoperability and integration issues. 
  • Distributed: More logs/traces, more opportunities for failure and more things to investigate. 
  • Dynamic: Topology keeps changing with time; going back in time to analyze issues is hard. 
  • Decentralized: Getting an aggregated view across loosely integrated tools is challenging.

As summarized in the diagram below, decreasing app development friction early in the build/deploy phases of the software development lifecycle has shifted work to later run/manage phases by increasing operational complexity:

Figure 2: Shifts in complexity from Dev to Ops.

Figure 2: Shifts in complexity from Dev to Ops.

In other words, some of the work of DevOps has shifted from "Dev" to "Ops." So how does a smart operations team address this potential burden? Following the time-honored tradition of IT, we introduce a smarter, more modern tool for the job: AIOps.

AIOps?! What, another tool? But we already have too many!

During the early days of large-scale enterprise Java deployments, more than a decade ago, there was a dearth of tools. Visibility into application performance was truly opaque. Problems were solved by physically collecting and visually scanning logs to look for errors and then manually correlating events across related trace files. The environments were simplistic — cluster sizes were small and deployments were relatively static.

Over time, deployments grew in size and shape, morphing from physical to virtualized environments. Operational complexity grew, too. Incident resolution was no longer isolated to well-known components.

Specialized tools in the IT service management (ITSM) domain arose to address the pain points around distributed application management. Tools ranged from application performance management, log aggregators and monitoring tools to IT operational analytics and service desk management. As depicted below, these tools — plus all the native product-specific monitoring mechanisms — were collectively stitched together to form an ad hoc monitoring suite: 

Figure 3: Tool sprawl and operational silos.

Figure 3: Tool sprawl and operational silos.

These best-of-breed tools, although very good at solving specific problems, led to narrowly scoped operational silos and a fragmented view of the enterprise infrastructure. These tools failed to provide the end-to-end visibility necessary to solve the complex problems of rapidly evolving enterprise workloads. Many of the existing tools employ a rules-based method to monitoring and alerting. In an increasingly complex and dynamic IT environment, a rules-based approach is fragile, costly to maintain and difficult to scale.    

Addressing the impact on incident resolution times and service reliability

Digital services have become increasingly important for businesses and consumers alike. Service reliability matters now more than ever; it's not exaggerating to say it's critical for business success. If not managed proactively, lapses in service can pose an existential competitive threat to an organization. 

Beyond the technical considerations noted above, the following are some organizational factors that can adversely impact incident resolution. If they're not also addressed as part of an organization's adoption of cloud native architectures, service reliability will be adversely impacted: 

  • Ratio of IT Dev vs. Ops professionals: Although many organizations are beginning to embrace DevOps methodologies, most organizations have far more developers than operations engineers. Popular community surveys put the Dev:Ops ratio between 15:1 to 25:1. 
  • IT budgets are relatively static: Investments favor developing new features while smooth operationalization of these services is treated as an afterthought. Too often, issues are resolved through expensive finger-pointing war rooms that are exacerbated by hidden pockets of tribal knowledge. 
  • Interoperability between new and traditional systems: Newer cloud native systems of engagement apps (~20%) still need to interoperate with legacy-based systems-of-record (~80%) where most of the enterprise institutional knowledge resides. The friction between these two very different operational models impose an additional burden on the Site Reliability Engineering (SRE) and ITOps personnel.

Given all the above, it's not surprising Ops professionals struggle to find time for skills development. 

These factors point to a gap in the current tools and confirm the need for a more modern approach to drive operational efficiency. We need to equip our ITOps and SREs with a more effective toolset for them to be productive in this rapidly evolving IT landscape.

Top 5 technical pain points that impact service reliability

Based on the discussion above and the various contributing factors, listed below are the top 5 technical operational pain points that impact service reliability (i.e., availability, performance and serviceability of an application). 

  1. Increasingly complex deployments: Because legacy systems-of-record systems may not evolve at the same pace as newer cloud native apps, the expectation of seamless interoperability between them can be difficult to achieve.
  2. Too many alerts: More dynamic, loosely coupled, distributed components lead to more incidents, uncorrelated events, false positives and ultimately "alert fatigue" that may tempt operations to just ignore warning signs.
  3. Lack of visibility and very reactive: Too many niche tools, good at what they do, lack the ability to correlate related events across silos, leading to a fragmented and "rear-view-mirror" perspective of the operational environment. 
  4. Tedious root cause analysis: Operation teams need cross-tier contextualization of events to identify root cause, impacting the time to resolution. Many existing tools lack meaningful problem determination assistance.
  5. Manual remediation: Too many manual hand-offs in problem remediation workflow. Numerous and often redundant tickets can be distracting and inefficient.

In the next section, I'll cover how the IBM Cloud Pak® for Watson AIOps sets out to address these pain points. 

AI and automation to deliver greater reliability with less risk

IBM Cloud Pak for Watson AIOps enables organizations to predict, communicate and resolve IT events before they become serious or impact the end-user. 

IBM Cloud Pak for Watson AIOps applies artificial intelligence (AI) and machine learning to structured and unstructured data from application logs and telemetry data generated by the disparate set of IT management tools. It can then analyze, prioritize and provide insights into IT incidents as they emerge in near real-time. In addition, by pinpointing faulty components and root-cause of failures, it derives a likely impact assessment on related components. Finally, it recommends short-term remedies or long-term resolutions — based on past incident history — that can often be applied in an automated fashion:

Figure 4: The IBM Cloud Pak for Watson AIOps — a high-level view.

Figure 4: The IBM Cloud Pak for Watson AIOps — a high-level view.

IBM Cloud Pak for Watson AIOps is composed of six broad functional components, as depicted below:

Figure 5: The IBM Cloud Pak for Watson AIOps — a functional view.

Figure 5: The IBM Cloud Pak for Watson AIOps — a functional view.

Below is a brief explanation of each functional component:

  1. Data ingestion: IBM Cloud Pak for Watson AIOps supports connectors that are able to connect, observe and ingest relevant data from a variety of application and infrastructure components, in large volumes and at high velocity. These various data types form the key input to IBM Cloud Pak for Watson AIOps. The data can be structured (e.g., metrics and topology), unstructured (e.g., tickets and chats) or semi-structured (e.g., logs and traces). 
  2. Anomaly detection: IBM Cloud Pak for Watson AIOps provides algorithms and machine learning models based on unsupervised learning methods, such as clustering and principal component analysis to detect abnormal patterns in log and metric input streams. This log anomaly and metric anomaly detection serves as an early warning indicator of developing issues — a sort of "check engine light" that can enable the SRE to take proactive remedial action.
  3. Correlation and contextualization: This is the core part of the IBM Cloud Pak for Watson AIOps — it ingests data from the log anomalies, metric anomalies, external alerts and events, including real-time topological information. It then constructs a holistic understanding of a potential or ongoing incident based on AI-driven reasoning. The dynamic topology information establishes point-in-time software-to-infrastructure mapping, which is especially helpful when dealing with ephemeral infrastructures such as Kubernetes and cloud platforms.
  4. Visualization and resolution: The emerging incidents that need an SRE’s attention are extracted and visualized in near real-time via a ChatOps interface like Slack or Microsoft Teams. From ChatOps, SREs can launch in-context to the originating tool to further analyze issues in either logs, metrics or tickets. IBM Cloud Pak for Watson AIOps points out the originating faulty component and the set of dependent components potentially impacted by the incident. 
  5. Collaboration and automation: Based on a historical analysis of similar incidents, IBM Cloud Pak for Watson AIOps suggests possible next-best-actions that can be taken to remedy the incident at hand. It also points to runbooks or other pre-defined remedial actions that can be executed to address the incident at hand, or, when necessary, drive intelligent workflow to facilitate collaborative resolution. 
  6. AI and SDLC governance: IBM Cloud Pak for Watson AIOps provides rich tooling to manage all facets of the AIOps lifecycle — from model training to execution. It leverages AI to analyze the impact of change requests and proactively assess the potential risk of an impending change before actually deploying it to production. In addition, it provides transparency into the rationale of the AI-driven decisions by providing the explainability, which is helpful to SREs and often required for enterprise audit and compliance purposes. 

How the IBM Cloud Pak for Watson AIOps accelerates incident resolution and enhances service reliability

From the viewpoint of today's typical user, any outage is critical. Faster incident resolution improves overall service reliability. Service reliability is represented and quantified via the popular mean time to repair (MTTR) metric. MTTR is an aggregate metric composed of five other mean times (MTT*): 

  1. Detect 
  2. Acknowledge
  3. Identify
  4. Fix
  5. Verify

These five steps reflect the sequence of an incident resolution process.  

From the operation team's viewpoint, pinpointing the cause, or mean time to identify (MTTI), is the main focus since it can vary dramatically versus the other time considerations for incident resolution (e.g., mean time to fix (MTTF), which is nearly constant).

This is how AI and machine learning can be best put to use in operations — cutting down the elapsed incident resolution time by providing insights into the cause, as depicted in Figure 6 below:

Figure 6: Reducing incident resolution time with AIOps.

Figure 6: Reducing incident resolution time with AIOps.

Below is a summary of how IBM Cloud Pak for Watson AIOps accelerates each step of the incident resolution process:

  1. Detect [MTTD]: Data Ingestion from all relevant and disparate data sources improves observability and  saves the SRE from having to manually aggregate various pieces of relevant telemetry data.
  2. Acknowledge [MTTA]: The Anomaly Detection capabilities serve as leading indicators of problems.  
  3. Identify [MTTI]: Event Correlation reduces noise and the Contextualization helps to point to root cause, helping SREs with their primary challenge of establishing the source of a failure.
  4. Fix [MTTF]: Once the root cause is established, applying the resolution is relatively fixed time, but IBM Cloud Pak for Watson AIOps provides suggestions on how to resolve the current incident.
  5. Verify [MTTV]: IBM Cloud Pak for Watson AIOps provides two key capabilities: (1) it can help with automating the remedial actions and associated verification testing, and (2) it assesses the impact of a fix before it is actually applied. This saves validation time and reduces the risk of regressions.

Collectively, these benefits will help drive operational efficiency, ITOps productivity and, most importantly, enhance the reliability posture of an organization. These can help your business be more successful and competitive, contributing quantifiable benefits to the bottom-line, while enhancing your client experience, brand image and employee satisfaction.  

Closing thoughts and next steps 

As described, AIOps addresses an emerging gap in capabilities required to manage the increasingly complex IT landscape. The sheer volume of data and the dynamic nature of deployments necessitate a modern approach to IT operations. 

The IBM Cloud Pak for Watson AIOps is a platform that is specifically designed to address the pain points highlighted in this post. By ingesting telemetry data from a broad collection of sources, detecting abnormal patterns across log and metrics early, correlating and contextualizing events, recognizing topology data across operational silos and delivering actionable insights, IBM Cloud Pak for Watson AIOps can help SREs accelerate root cause analysis and resolve incidents as they emerge in real-time.

To learn more about the specifics of IBM Cloud Pak for Watson AIOps, check out this video collection: 

They're short explanations of how the IBM Cloud Pak for Watson AIOps can reduce incident resolution time for your operations team and even detect/resolve problems before they are reported by end users. And, of course, if you have specific questions about the IBM Cloud Pak for Watson AIOps for your organization, you can schedule a call with one of our IBM automation experts.

Be the first to hear about news, product updates, and innovation from IBM Cloud