Steps to Build an Automated System for Change Risk Assessment

It is an established fact in the IT industry that change is one of the biggest contributors to service outages.

With more enterprises migrating their applications to cloud native deployment and using automated build and deployment pipelines, the volume and rate of change have significantly increased. That makes it difficult for the Site Reliability Engineers (SREs) to use the traditional methods of assessing risk for each change manually.

The Change Risk Prediction capability in IBM Cloud Pak® for Watson AIOps complements SREs’ skills and knowledge by alerting them to a possible problematic change and presenting historical evidence from their own or someone else’s prior experience. This capability helps SREs increase their efficiency and enables them to maintain high service quality in this fast-moving environment.

Building an automated system for change risk assessment is challenging. While many specific techniques for risk evaluation have been proposed, these generic methods could not be directly applied to change risk assessment. In this article, we describe the data used for experiments, highlight the challenges and provide a methodology for addressing these challenges.

Data

We use historical change and incident records that have the following information:

A single change record typically captures attributes like change number, change title, change description, change purpose, change environment, change team, closure code, backout plan, close notes and configuration items.
An incident record typically captures information like incident number, incident title, incident description, opened date, incident severity, impacted configuration item(s), outage start time, outage end time, incident state, resolution description and caused by change (change ID and details of change if incident is induced by change).

Challenge 1

Change records datasets are highly imbalanced and extremely noisy for most machine learning methods. Although many major incidents are caused by changes, most changes do not cause incidents. In general, changes happen frequently but incidents are rare, so the percentage of all changes that cause incidents is very low. We consider a change as problematic if the SRE was not able to deploy the change (e.g., the change failed), if it induced an incident during deployment or if the change was deployed successfully but subsequently caused an incident. This study is based on 227.7K change records gathered over a period of seven months. Only 2.1% of the changes are marked as problematic.

Furthermore, only a small number of incidents caused by change include explicit references to inducing changes, making the set of problematic changes incomplete and unusable “as is” as the ground truth unless it is extended with implicit linkages.

To address this challenge and create the set of problematic changes to be used for training the change risk model, we need to identify the implicit linkages between change and incident records. As the nature of information captured by change and incident records is different from each other, using standard similarity measures like cosine similarity would not help to discover implicit linkages between these two sets of records. We implement a four-step, semi-supervised, learning-based approach to leverage the explicit linkages and discover additional implicit linkages.

These are the four steps:

Identifying explicit linkages between change and incident records.
Generating all possible candidates change-incident pairs (implicit linkages).
Computing linkage strength.
Determining optimal linkage strength cutoff.

The summary of results is shown in the figure below. The implicit linkages with strength higher than the value of the cutoff (the dotted line) are added to the explicit linkages to complete the set of problematic changes to be used for training the change risk model:

Challenge 2

If a change is identified as potentially problematic, that information alone has limited value. To make it actionable, we need to provide a reason. We are determining risk level based on a model that is trained with past changes, so one would reasonably expect that some subset of the problematic changes in that set must be similar enough to the newly identified “risky” change. These similar changes together with their Root Cause Analysis (RCA) reports would be useful in creating an explanation.

Using complete ground truth by adding implicit change-incident linkages and choosing the best model performance metrics above, we train multiple classification models that help us separate problematic changes from successful changes. We use the pre-processed change text for extracting the features for training the change risk classifier. In addition to the change text, we also use structured fields like change environment and change team. The dependent variable for the binary classification is problematic, which takes values {0, 1}.

We use three methods for feature extraction from the change text to train different classifiers, as listed below:

Bag of words representation
Sequential representation using pre-trained word embeddings for training
Concatenated representation

Evaluation

Table 1 shows the comparison of the performance of classifiers without handling the class imbalance. The results show a high precision, but the Recall values are low, with the best at around 0.69. This approach is used with an automatic deployment where alerts are sent to an SRE and deployment of the change deferred to the SRE. The SVM based classifier outperforms other classifiers and gives the highest F0.5 score of 0.88 and the precision is almost perfect at 0.98.

Table 2 shows the comparison of the performance of classifiers using algorithm level methods for handling the class imbalance. The results here show a significant increase in the recall values. This approach is used in a manual deployment setting. Although the precision values drop significantly, the change-risk alerts in this setting are adding information for due diligence done by SREs during manual change deployment process. The LSTM3 classifier outperforms all other classifiers for this scenario and gives the highest F2.0 score of 0.72 and a recall of 0.82.

Table 1: Change Risk model performance without class weights balancing.

Table 2: Change Risk model performance using class weights balancing.

A combination of a semi-supervised learning technique for discovering implicit linkages between change and incident records and a set of supervised learning techniques for change risk assessment have rendered a good prediction performance, as shown in the results above.

Learn more

In this article we described a novel methodology that is the base of the Change Risk Prediction capability in IBM Cloud Pak® for Watson AIOps. Having an automated change risk assessment allows SREs to focus on the changes that truly require their attention, improving reliability, performance or utilization, while reducing the time they spent on toil.

Be sure to check out this capability in tech preview coming on March 31, 2021.

Want to know what other capabilities are included in the IBM Cloud Pak for Watson AIOps? Join this webinar to learn more.

Was this article helpful?

YesNo

Larisa Shwartz

Chief Data Scientists, IBM Consulting Hybrid Cloud Services

Raghav Batta

Senior Developer, Hybrid Cloud Services

Michael Nidd

RSM, Systems Management

Pritam Gundecha

Senior Data Scientist, AI for IT Operations