It is an established fact in the IT industry that change is one of the biggest contributors to service outages.

With more enterprises migrating their applications to cloud native deployment and using automated build and deployment pipelines, the volume and rate of change have significantly increased. That makes it difficult for the Site Reliability Engineers (SREs) to use the traditional methods of assessing risk for each change manually.

The Change Risk Prediction capability in IBM Cloud Pak® for Watson AIOps complements SREs’ skills and knowledge by alerting them to a possible problematic change and presenting historical evidence from their own or someone else’s prior experience. This capability helps SREs increase their efficiency and enables them to maintain high service quality in this fast-moving environment.

Building an automated system for change risk assessment is challenging. While many specific techniques for risk evaluation have been proposed, these generic methods could not be directly applied to change risk assessment. In this article, we describe the data used for experiments, highlight the challenges and provide a methodology for addressing these challenges.

Data

We use historical change and incident records that have the following information:

  • A single change record typically captures attributes like change number, change title, change description, change purpose, change environment, change team, closure code, backout plan, close notes and configuration items.
  • An incident record typically captures information like incident number, incident title, incident description, opened date, incident severity, impacted configuration item(s), outage start time, outage end time, incident state, resolution description and caused by change (change ID and details of change if incident is induced by change).

Challenge 1

Change records datasets are highly imbalanced and extremely noisy for most machine learning methods. Although many major incidents are caused by changes, most changes do not cause incidents. In general, changes happen frequently but incidents are rare, so the percentage of all changes that cause incidents is very low. We consider a change as problematic if the SRE was not able to deploy the change (e.g., the change failed), if it induced an incident during deployment or if the change was deployed successfully but subsequently caused an incident. This study is based on 227.7K change records gathered over a period of seven months. Only 2.1% of the changes are marked as problematic.

Furthermore, only a small number of incidents caused by change include explicit references to inducing changes, making the set of problematic changes incomplete and unusable “as is” as the ground truth unless it is extended with implicit linkages.

To address this challenge and create the set of problematic changes to be used for training the change risk model, we need to identify the implicit linkages between change and incident records. As the nature of information captured by change and incident records is different from each other, using standard similarity measures like cosine similarity would not help to discover implicit linkages between these two sets of records. We implement a four-step, semi-supervised, learning-based approach to leverage the explicit linkages and discover additional implicit linkages.

These are the four steps:

  1. Identifying explicit linkages between change and incident records.
  2. Generating all possible candidates change-incident pairs (implicit linkages).
  3. Computing linkage strength.
  4. Determining optimal linkage strength cutoff.

The summary of results is shown in the figure below. The implicit linkages with strength higher than the value of the cutoff (the dotted line) are added to the explicit linkages to complete the set of problematic changes to be used for training the change risk model:

Challenge 2

If a change is identified as potentially problematic, that information alone has limited value. To make it actionable, we need to provide a reason. We are determining risk level based on a model that is trained with past changes, so one would reasonably expect that some subset of the problematic changes in that set must be similar enough to the newly identified “risky” change. These similar changes together with their Root Cause Analysis (RCA) reports would be useful in creating an explanation.

Using complete ground truth by adding implicit change-incident linkages and choosing the best model performance metrics above, we train multiple classification models that help us separate problematic changes from successful changes. We use the pre-processed change text for extracting the features for training the change risk classifier. In addition to the change text, we also use structured fields like change environment and change team. The dependent variable for the binary classification is problematic, which takes values {0, 1}.

We use three methods for feature extraction from the change text to train different classifiers, as listed below:

  • Bag of words representation
  • Sequential representation using pre-trained word embeddings for training
  • Concatenated representation

Evaluation

Table 1 shows the comparison of the performance of classifiers without handling the class imbalance. The results show a high precision, but the Recall values are low, with the best at around 0.69. This approach is used with an automatic deployment where alerts are sent to an SRE and deployment of the change deferred to the SRE. The SVM based classifier outperforms other classifiers and gives the highest F0.5 score of 0.88 and the precision is almost perfect at 0.98.

Table 2 shows the comparison of the performance of classifiers using algorithm level methods for handling the class imbalance. The results here show a significant increase in the recall values. This approach is used in a manual deployment setting. Although the precision values drop significantly, the change-risk alerts in this setting are adding information for due diligence done by SREs during manual change deployment process. The LSTM3 classifier outperforms all other classifiers for this scenario and gives the highest F2.0 score of 0.72 and a recall of 0.82.

Table 1: Change Risk model performance without class weights balancing.

Table 2: Change Risk model performance using class weights balancing.

A combination of a semi-supervised learning technique for discovering implicit linkages between change and incident records and a set of supervised learning techniques for change risk assessment have rendered a good prediction performance, as shown in the results above.

Learn more

In this article we described a novel methodology that is the base of the Change Risk Prediction capability in IBM Cloud Pak® for Watson AIOps. Having an automated change risk assessment allows SREs to focus on the changes that truly require their attention, improving reliability, performance or utilization, while reducing the time they spent on toil.

Be sure to check out this capability in tech preview coming on March 31, 2021.

Want to know what other capabilities are included in the IBM Cloud Pak for Watson AIOps? Join this webinar to learn more.

Was this article helpful?
YesNo

More from Cloud

Private cloud use cases: 6 ways private cloud brings value to enterprise business

7 min read - As cloud computing continues to transform the enterprise workplace, private cloud infrastructure is evolving in lockstep, helping organizations in industries like healthcare, government and finance customize control over their data to meet compliance, privacy, security and other business needs.  According to a report from Future Market Insights (link resides outside ibm.com), the global private cloud services market is forecast to grow to USD 405.30 billion by 2033, up from USD 92.64 billion in 2023.  What is a private cloud? A private cloud is…

Hyperscale vs. colocation: Go big or go rent?

9 min read - Here’s the situation: You’re the CIO or similarly empowered representative of an organization. Different voices within your business are calling attention to the awesome scalability and power of hyperscale computing, which you’ve also noticed with increasing interest. Now the word comes down from on high that you’ve been tasked with designing and implementing your company’s hyperscale computing solution—whatever that should be. Your organization already has an ambitious agenda in mind for whatever IT infrastructure you wind up choosing. The company…

IBM Tech Now: March 25, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 95 On this episode, we're covering the IBM X-Force Threat Intelligence Index 2024: IBM X-Force Cyber Range Combating deepfakes Stay plugged in You can check out the IBM Blog Announcements for a full rundown…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters