About AI models
IBM Cloud Pak® for AIOps provides a wide range of AI algorithms to gather data and learn from your environment, and provide insights to support resolution of health issues across your application, service, and network infrastructure.
For each of the following AI algorithms, this topic describes the algorithm, explains how it helps you, and, where configuration is necessary, points to the relevant configuration tasks.
- Alert seasonality detection
- Change risk
- Event grouping
- Log anomaly detection
- Metric anomaly detection
- Probable cause
- Similar tickets
Alert seasonality detection
This AI algorithm groups alerts that tend to occur within a seasonal time window. Periodicity of alert occurences is an indication of a common underlying cause, which might help you to diagnose the underlying cause of the alert. The model also enables you to take different actions on alerts depending on if they are occuring within the seasonal window, or outside. If a seasonal alert is considered benign when occuring within the time window, you can also suppress seasonal alerts.
To train this algorithm, you must initially configure it using the console, as described in Configuring alert seasonality detection.
When this algorithm is successfully trained and the resulting AI model deployed, alerts that occur within a seasonal time window are grouped together. Site reliability engineers (SREs) and other users responsible for application and service availability can then view the details in the Alert Viewer, as described in Displaying alert seasonality.

Change risk
This AI algorithm provides an assessment of the risk of implementing a proposed change; for example, a code change or a new software version.
Hundreds of changes can affect an application during its lifecycle. Change risk takes historical data about all of those changes and helps you determine how likely it is that a specific change would cause a problem, based on how successfully that similar change was deployed in the past. Using the assessment score provided by this AI model, you can determine how safe it is to proceed with the change.
Training this AI algorithm helps you ensure that risky changes for an application are assessed before deployment. To train this algorithm, you must initially configure it using the console, as described in Configuring change risk.
Once this algorithm is successfully trained and the resulting AI model deployed, then when someone opens a change request ticket, a Change risk assessment appears in the proactive ChatOps channel. For more details, see Change risk.

Event grouping
Three AI algorithms are provided that group events together and present the groups within a single incident.
The three algorithms are the following:
Temporal grouping
This AI algorithm groups events that are discovered to co-occur over time. When a problem arises, there are typically multiple parts of a system or environment that are impacted. When events in different areas co-occur, it makes sense to look at them together and treat them as one problem to try to determine what might have happened.
Grouping co-occurring events together reduces the number of tickets and incidents opened and the number of people looking at the same problem, thereby significantly reducing noise in your monitoring systems. It helps you to understand the context of an issue so you can prioritize, triage, and resolve it more quickly.
To train this algorithm, you must initially configure it using the console, as described in Configuring temporal grouping.
When this algorithm is enabled related alerts are grouped based on when they occur, Site reliability engineers (SREs) and other users responsible for application and service availability are able to view the details in the Alert Viewer, as described in About alerts.

Topological grouping
This AI algorithm groups your events based on the resource groups in which events events occur. For example if you have a resource group made up of all the resources within a given Kubernetes namespace, then any events on pods, microservices, or other resources in that namespace will be grouped together in a single topological group.
Topological grouping helps you understand when events are connected based on their topology, providing valuable context for why related events might occur together.
This algorithm is enabled so when related events are grouped based on their topology, Site reliability engineers (SREs) and other users responsible for application and service availability will be able to view the details in the Alert Viewer, as described in About alerts.
Scope-based grouping
This AI algorithm automatically groups events relating to an incident if they have the same defined scope and occur during the same period of time. A scope can be used to identify where events originate based on a common attribute, for example, the location of a server room.
By understanding when events are related based on both time and location, you can more quickly diagnose incidents.
This algorithm is enabled so when related events are grouped based on their scope, Site reliability engineers (SREs) and other users responsible for application and service availability will be able to view the details in the Alert Viewer, as described in About alerts.
Log anomaly detection
This pair of AI algorithms gather log data from one or multiple components in the application architecture, identifies a baseline of expected log message types, and uses this baseline to discover abnormal behavior in your live log data. For more information on log anomaly detection algorithms, see About log anomaly detection algorithms.
Metric anomaly detection
Metric anomaly detection first learns normal patterns of metric behavior by analyzing metric values at regular intervals. If that behavior significantly changes, it raises anomalies or alerts.
For more information, see About metric detection algorithms.

Probable cause
This AI algorithm identifies the event with the greatest probability of being the cause of an incident.
Probable cause analyzes event and topology data to understand how events are related to each other. It then uses that understanding to try to determine the root cause of a problem.
All events in an incident are given a probability score that indicates the likelihood of that event being the cause of the incident. events that occur on resource that is lower down in the topology are giving a higher probability score. events are also categorized and are scored based on the category that they fall into. This ensures that if two events occur on the same resource in the topology, the event with the more significant category will have a higher probability score.
Once the SRE has identified the probable cause event, they can take action on the resource that is generating that event and quickly resolve the incident.
The probable cause AI algorithm is enabled by default, so that information about the probable cause of a problem is automatically included in your incident details, which are surfaced in your ChatOps interface, as described in Managing incidents.
Similar tickets
This AI algorithm discovers details about similar messages, anomalies, and events within your tickets that occurred in the past and are impacting the current application.
When an incident occurs, it can be helpful to review details for similar tickets to help determine a resolution. This model aggregates information about similar messages, anomalies, and events for a component or application. This model can also extract the steps used to fix previous incidents, if documented.
Training this AI model helps you discover historical incidents to aid in the remediation of current problems. To train this algorithm, you must initially configure it using the console, as described in Setting up training for similar tickets.
Site reliability engineers (SREs) and other users responsible for application and service availability are able to access incidents in their ChatOps interface and within each incident view historical similar tickets, as described in Managing incidents.
