Troubleshooting alert management

No alert or incident created when an event is sent

Symptom A: An event is sent, but no alert is received by IBM Cloud Pak for AIOps and no incident is created.

Running waiops-mustgather.sh -V flink-jobs using the must-gather tool, returns "RESTARTING" or "FAILED" under the entry [FLINK = aiops-ir-lifecycle]. For example:

[FLINK = cp4aiops]
{
  "jobs": []
}

[FLINK = aiops-ir-lifecycle]
{
  "jobs": [
    {
      "id": "4e442c7880239bede340e953e6b920b3",
      "status": "RESTARTING"
    }
  ]
}

This issue might be caused by a volume of event data that is too large for the given installation size and configured policy set. This can cause the policy engine to run out of available memory.

Solution: To resolve this issue, complete the following steps:

  1. Evaluate the sizing of the installation compared to the current event load. If there is a miss-match, change the installed sizing, or the configured event sources.

  2. Ensure that incident creation policies are only configured to raise incidents for alerts which require immediate resolution. Incidents are more resource intensive to process than alerts, so creating incidents for all alerts will require a larger sized installation.

  3. After re-configuring the system, the policy engine may still be in a failure state as it tries to re-process the already-inserted events. It is possible to reset the policy engine so that it only processes new events using the must-gather tool:

    waiops-mustgather.sh -DR -C clear-lifecycle-state.sh
    
    Note: Resetting the policy engine will cause all policy state to be lost. This may result in events creating new alerts and incidents rather than being deduplicated into existing ones. It will also mean that any events inserted in the time between the failure and running this command will not be processed by the system.
  4. Restart the flink pods:

    1. Search for the pods that match the string aiops-ir-lifecycle-flink

      This should return four pods.

    2. Right-click and Delete the pods.

      The pods should restart immediately.

Symptom B: An event is sent, but no alert is received by IBM Cloud Pak for AIOps and no incident is created. The log of the lifecycle task manager pod contains the following error:

WARNING: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: io.grpc.StatusRuntimeException: UNAVAILABLE: ssl exception

The issue might be a certificate mismatch between the lifecycle operator and the policy GRPC server, causing the connection to fail with SSL exceptions.

Solution: To resolve this issue, complete the following steps:

  1. Run the following command to delete the certificate:

    oc delete secret aiops-ir-lifecycle-policy-grpc-clt-svc aiops-ir-lifecycle-policy-grpc-svc
    
  2. Get the lifecycle-operator pod name:

    oc get pods |grep ir-lifecycle-operator-controller-manager
    
  3. Restart the lifecycle operator:

    oc delete pod ir-lifecycle-operator-controller-manager-<lifecycle-operator-pod-name>
    

No alert ranking with large installations

On larger production scale installations of IBM Cloud Pak for AIOps, no ranking of alerts is displayed in the Alert Viewer.

Solution: To resolve this issue, complete the following steps:

  1. Run the following command to find the classifier and probablecause pods:

    oc get pods | egrep -i "classifier|probablecause"
    

    For example:

    oc get pods | egrep -i "classifier|probablecause"
    
    aiops-ir-analytics-classifier-6477fd7788-4rzwx                    1/1     Running     0
    aiops-ir-analytics-classifier-6477fd7788-8jd55                    1/1     Running     0
    aiops-ir-analytics-probablecause-5b89d56c44-b752h                 1/1     Running     1
    aiops-ir-analytics-probablecause-5b89d56c44-kpqcb                 1/1     Running     1
    
  2. Restart the pods:

    oc delete pods aiops-ir-analytics-classifier-6477fd7788-4rzwx aiops-ir-analytics-classifier-6477fd7788-8jd55 aiops-ir-analytics-probablecause-5b89d56c44-b752h aiops-ir-analytics-probablecause-5b89d56c44-kpqcb
    

Closed alerts are not being removed from Alert Viewer

You might notice an issue where closed alerts are not being removed from the Alert Viewer.

If this issue occurs, you might see an error that is related to timeouts for requests to Postgres. This error can be similar to the following example:

[2023-12-12T17:30:30.022Z] ERROR: datalayer.postgres/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Error detected on streaming query
    Error: Query timed out
        at Timeout._onTimeout (/app/lib/helpers/dbservice/lib/helpers/dbservice/postgres.js:178:20)
        at listOnTimeout (node:internal/timers:559:17)
        at processTimers (node:internal/timers:502:7)
[2023-12-12T17:30:30.022Z] ERROR: datalayer.postgres/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Error detected on client connection. Ending query stream (reqid=13778660-9914-11ee-844d-e573891ebc4b)
    Error: Query timed out
        at Timeout._onTimeout (/app/lib/helpers/dbservice/lib/helpers/dbservice/postgres.js:178:20)
        at listOnTimeout (node:internal/timers:559:17)
        at processTimers (node:internal/timers:502:7)
[2023-12-12T17:30:30.024Z] ERROR: datalayer.postgres/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Error detected on streaming query cursor
    Error: Connection terminated
        at Connection.<anonymous> (/app/node_modules/pg/lib/client.js:132:36)
        at Object.onceWrapper (node:events:627:28)
        at Connection.emit (node:events:525:35)
        at Socket.<anonymous> (/app/node_modules/pg/lib/connection.js:63:12)
        at Socket.emit (node:events:525:35)
        at TCP.<anonymous> (node:net:301:12)

This issue might be caused by the failure of the datalayer job manager. Since the job is not running, there is no automated process for clearing the closed alerts.

Solution: Use the following steps to resolve the closed alerts issue:

  1. Export an environment variable for your project.

    For a deployment of IBM Cloud Pak for AIOps on Linux, commands can be run on a control plane node.

    export PROJECT_CP4AIOPS=<project>
    oc project $PROJECT_CP4AIOPS
    

    Where <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed. This is usually cp4aiops for a deployment on Red Hat OpenShift Container Platform, or aiops for a deployment on Linux.

    For a deployment of IBM Cloud Pak for AIOps on Linux, commands can be run on a control plane node.

  2. Determine the name of the datalayer job manager pod.

    oc get pod -l "app.kubernetes.io/component=ncodl-jobmgr"
    

    The output from the preceding command can look like the following example::

    [root@6c025b05c0a7 /]# oc get pod -l "app.kubernetes.io/component=ncodl-jobmgr"
    NAME                                          READY   STATUS    RESTARTS   AGE
    aiops-ir-core-ncodl-jobmgr-85458b4979-gqgj8   1/1     Running   0          5h11m
    
  3. Use the pod name to describe the job manager pod. If the pod is in a crash loop or not scheduled status, use the following command to see the failure:

    oc describe pod <pod name>
    

    Where <pod name> is the name of the datalayer job manager pod.

    If the failure reason from describe command is not clear or the pod appears to be running fine then check the pod logs for potential run failure conditions. Use the following command to check the logs:

    oc logs <pod name>
    

    Where <pod name> is the name of the datalayer job manager pod.

    You can, for example, search for "Alert delete job" in the logs. "Alert delete job" must be running. The logs for delete job are shown in the following example.

    [2023-12-12T17:38:00.002Z]  INFO: datalayer.jobmgr/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Running: Alert delete job [2023-12-12T17:38:30.120Z]  INFO: datalayer.jobmgr/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Complete: Alert delete job
     metrics: {
       "fetched": 0,
       "deleted": 0,
       "failed": 0
     }
     --
     psql: [
       {
         "name": "deletes",
         "mean": 0,
         "min": 0,
         "max": 0
       },
       {
         "name": "relatedDeletes",
         "mean": 0,
         "min": 0,
         "max": 0
       }
     ]
    
  4. Check the status of the aiops-ir-core-postgres cluster.

    oc get clusters.postgresql.k8s.enterprisedb.io aiops-ir-core-postgres
    
    Example output:
    NAME                     AGE    INSTANCES   READY   STATUS                     PRIMARY
    aiops-ir-core-postgres   160m   3           3       Cluster in healthy state   aiops-ir-core-postgres-1
    

    If the Postgres cluster is healthy, proceed to the next step. Otherwise, contact IBM Support.

  5. If the Postgres instance is running fine, then it is possible that the instance might be busy and overloaded. If you see the preceding error such as "timeouts on request Postgres" or if the Postgres instance is running fine, in both cases use the following script to manually purge and recover the system.

    oc exec $(oc get clusters.postgresql.k8s.enterprisedb.io aiops-ir-core-postgres -o jsonpath='{.status.currentPrimary}') -- psql -U postgres -c \
    "\connect aiops_irb" -c "DELETE FROM aiops_irb.alerts_base_tp WHERE state='closed';"
    

"Unknown error" message in Alert Viewer and no alerts displayed

Symptom: No alerts are shown in the Cloud Pak for AIOps Alert Viewer and "Unknown error" message is displayed on the UI.

Cause: The problem might be that the number of alerts exceeds what the Alert Viewer can process. A production deployment of Cloud Pak for AIOps can process a standing alert count of up to 200,000 alerts, with a chosen subset displayed in the Alert Viewer. The precise number of alerts that can be shown depends on different permutations of variables that affect the performance of the Alert Viewer. For example, standing alert count, filtered alert count, and view used (such as simple columns with alert data, or complex columns with insights and custom user enriched columns).

Solution: At first (during deployment or creation of a new view), try excluding columns with insights or columns based on custom properties. You can then add these columns sequentially to better gauge how many alerts a view can handle. For more information, see Creating views.

You can also restrict the number of alerts that are returned by the data layer to the Alert Viewer. This can improve performance issues on the backend. For more information, see Restricting the number of alerts returned by the data layer to the Alert Viewer.

Alert viewer not updating due to checkpoint failures

Symptom:The alert viewer might stop updating and fail to display new or updated alerts. When this occurs, the aiops-ir-lifecycle-flink-* pods show checkpoint failure warnings in their logs with read timeout errors:
[WARN] Failed to trigger or complete checkpoint...
org.apache.flink.runtime.checkpoint.CheckpointException: Asynchronous task checkpoint failed.
...
Caused by: java.net.SocketTimeoutException: Read timed out

Solution: Contact IBM support for assistance.

Alert list ignores saved view sort order

Symptom: The alerts table sort order you wanted to apply to it does not save as expected.

Cause: It might be related to a similar issue, which deals with editing the alert view and how it then does not show the correct configuration when launched for the first time. This is due to having to fetch users and groups before being able to show the right view configuration. This is being fixed now also, so should no longer be an issue in future releases.

Solution: If the original alert table sort order didn't save as expected, you have to apply the sort order again after the alerts table is loaded.

"Unknown error" message in Alert Viewer when you click the alert view without selecting the ID and Summary columns

Symptom: You may get the unknown error when you click the alert view without having selected the ID and SUMMARY columns.

Cause: This problem is because the view requires the ID and SUMMARY columns to be part of the view. If these columns are not included in the view then it will be unusable and the only way to delete that view will be by using the API.

Solution: Include the ID and SUMMARY columns in the alert view. If you do not and cannot then delete the unusable view that results, you will have to delete the view using the API.