Troubleshooting alert management

Learn how to isolate and resolve problems with alert management.

No alert or incident created when an event is sent

Symptom A: An event is sent, but no alert is received by IBM Cloud Pak® for AIOps and no incident is created.

Running waiops-mustgather.sh -V flink-jobs using the must-gather tool, returns "RESTARTING" or "FAILED" under the entry [FLINK = aiops-ir-lifecycle]. For example:

[FLINK = cp4aiops]
{
  "jobs": []
}

[FLINK = aiops-ir-lifecycle]
{
  "jobs": [
    {
      "id": "4e442c7880239bede340e953e6b920b3",
      "status": "RESTARTING"
    }
  ]
}

This issue might be caused by a volume of event data that is too large for the given installation size and configured policy set. This can cause the policy engine to run out of available memory.

Solution: To resolve this issue, complete the following steps:

  1. Evaluate the sizing of the installation compared to the current event load. If there is a miss-match, change the installed sizing, or the configured event sources.

  2. Ensure that incident creation policies are only configured to raise incidents for alerts which require immediate resolution. Incidents are more resource intensive to process than alerts, so creating incidents for all alerts will require a larger sized installation.

  3. After re-configuring the system, the policy engine may still be in a failure state as it tries to re-process the already-inserted events. It is possible to reset the policy engine so that it only processes new events using the must-gather tool:

    waiops-mustgather.sh -DR -C clear-lifecycle-state.sh
    

    NOTE: Resetting the policy engine will cause all policy state to be lost. This may result in events creating new alerts and incidents rather than being deduplicated into existing ones. It will also mean that any events inserted in the time between the failure and running this command will not be processed by the system.

  4. Restart the flink pods:

    1. Search for the pods that match the string aiops-ir-lifecycle-eventprocessor-ep
      This should return four pods.
    2. Right-click and Delete the pods.
      The pods should restart immediately.

Symptom B: An event is sent, but no alert is received by IBM Cloud Pak® for AIOps and no incident is created. The log of the lifecycle task manager pod contains the following error:

WARNING: Exception thrown during asynchronous load
java.util.concurrent.CompletionException: io.grpc.StatusRuntimeException: UNAVAILABLE: ssl exception

The issue might be a certificate mismatch between the lifecycle operator and the policy GRPC server, causing the connection to fail with SSL exceptions.

Solution: To resolve this issue, complete the following steps:

  1. Run the following command to delete the certificate:

    oc delete secret aiops-ir-lifecycle-policy-grpc-clt-svc aiops-ir-lifecycle-policy-grpc-svc
    
  2. Get the lifecycle-operator pod name:

    oc get pods |grep ir-lifecycle-operator-controller-manager
    
  3. Restart the lifecycle operator:

    oc delete pod ir-lifecycle-operator-controller-manager-<lifecycle-operator-pod-name>
    

No alert ranking with large installations

On larger production scale installations of IBM Cloud Pak® for AIOps, no ranking of alerts is displayed in the Alert Viewer.

Solution: To resolve this issue, complete the following steps:

  1. Run the following command to find the classifier and probablecause pods:

    oc get pods | egrep -i "classifier|probablecause"
    

    For example:

    oc get pods | egrep -i "classifier|probablecause"
    
    aiops-ir-analytics-classifier-6477fd7788-4rzwx                    1/1     Running     0
    aiops-ir-analytics-classifier-6477fd7788-8jd55                    1/1     Running     0
    aiops-ir-analytics-probablecause-5b89d56c44-b752h                 1/1     Running     1
    aiops-ir-analytics-probablecause-5b89d56c44-kpqcb                 1/1     Running     1
    
  2. Restart the pods:

    oc delete pods aiops-ir-analytics-classifier-6477fd7788-4rzwx aiops-ir-analytics-classifier-6477fd7788-8jd55 aiops-ir-analytics-probablecause-5b89d56c44-b752h aiops-ir-analytics-probablecause-5b89d56c44-kpqcb
    

Closed alerts are not being removed from Alert Viewer

You might notice an issue where closed alerts are not being removed from the Alert Viewer.

If this issue occurs, you might see an error that is related to timeouts for requests to Postgres. This error can be similar to the following example:

[2023-12-12T17:30:30.022Z] ERROR: datalayer.postgres/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Error detected on streaming query
    Error: Query timed out
        at Timeout._onTimeout (/app/lib/helpers/dbservice/lib/helpers/dbservice/postgres.js:178:20)
        at listOnTimeout (node:internal/timers:559:17)
        at processTimers (node:internal/timers:502:7)
[2023-12-12T17:30:30.022Z] ERROR: datalayer.postgres/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Error detected on client connection. Ending query stream (reqid=13778660-9914-11ee-844d-e573891ebc4b)
    Error: Query timed out
        at Timeout._onTimeout (/app/lib/helpers/dbservice/lib/helpers/dbservice/postgres.js:178:20)
        at listOnTimeout (node:internal/timers:559:17)
        at processTimers (node:internal/timers:502:7)
[2023-12-12T17:30:30.024Z] ERROR: datalayer.postgres/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Error detected on streaming query cursor
    Error: Connection terminated
        at Connection.<anonymous> (/app/node_modules/pg/lib/client.js:132:36)
        at Object.onceWrapper (node:events:627:28)
        at Connection.emit (node:events:525:35)
        at Socket.<anonymous> (/app/node_modules/pg/lib/connection.js:63:12)
        at Socket.emit (node:events:525:35)
        at TCP.<anonymous> (node:net:301:12)

This issue might be caused by the failure of the datalayer job manager. Since the job is not running, there is no automated process for clearing the closed alerts.

Solution: Use the following steps to resolve the closed alerts issue:

  1. Determine the name of the datalayer job manager pod.

    oc get pod -l "app.kubernetes.io/component=ncodl-jobmgr"
    

    The output from the preceding command can look like the following example::

    [root@6c025b05c0a7 /]# oc get pod -l "app.kubernetes.io/component=ncodl-jobmgr"
    NAME                                          READY   STATUS    RESTARTS   AGE
    aiops-ir-core-ncodl-jobmgr-85458b4979-gqgj8   1/1     Running   0          5h11m
    
  2. Use the pod name to describe the job manager pod. If the pod is in a crash loop or not scheduled status, use the following command to see the failure:

    oc describe pod <pod name>
    

    Where <pod name> is the name of the datalayer job manager pod.

    If the failure reason from describe command is not clear or the pod appears to be running fine then check the pod logs for potential run failure conditions. Use the following command to check the logs:

    oc logs <pod name>
    

    where <pod name> is the name of the datalayer job manager pod.

    You can, for example, search for "Alert delete job" in the logs. "Alert delete job" must be running. The logs for delete job are shown in the following example.

    [2023-12-12T17:38:00.002Z]  INFO: datalayer.jobmgr/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Running: Alert delete job [2023-12-12T17:38:30.120Z]  INFO: datalayer.jobmgr/14 on aiops-ir-core-ncodl-jobmgr-7f4c64dcdd-d7xjz: Complete: Alert delete job
     metrics: {
       "fetched": 0,
       "deleted": 0,
       "failed": 0
     }
     --
     psql: [
       {
         "name": "deletes",
         "mean": 0,
         "min": 0,
         "max": 0
       },
       {
         "name": "relatedDeletes",
         "mean": 0,
         "min": 0,
         "max": 0
       }
     ]
    
  3. Check the status of the primary PostgreSQL instance by using the following command:

    oc get clusters.postgresql.k8s.enterprisedb.io ibm-cp-aiops-edb-postgres
    
    

    Sample output:

    NAME                        AGE     INSTANCES    READY    STATUS                     PRIMARY
    ibm-cp-aiops-edb-postgres   41d      3           3      Cluster in healthy state   ibm-cp-aiops-edb-postgres-1
    

    The PostgreSQL pod name might be similar to ibm-cp-watson-aiops-edb-postgres-*. The asterisk (*) represents multiple pods that might be running for PostgreSQL.

    If the PostgreSQL cluster is healthy, note the pod name listed under the PRIMARY column (e.g., ibm-cp-aiops-edb-postgres-1) and proceed to the next step.

    If the cluster is not healthy, contact IBM Support for assistance.

  4. If the Postgres instance is healthy and the issue persists, then the instance might be busy and overloaded. You might also notice a preceding error such as "timeouts on request Postgres". Use the following script to manually purge and recover the system:

    #!/bin/bash
    
    EDBPOD=<POSTGRES PODNAME>
    EBDSECRET=aiops-ir-core-postgresql
    
    # Get the credentials from required sec
    USERNAME=$(oc get secret ${EBDSECRET} -o=jsonpath='{.data.username}' | base64 -d)
    PASSWORD=$(oc get secret ${EBDSECRET} -o=jsonpath='{.data.password}' | base64 -d)
    DBNAME=$(oc get secret ${EBDSECRET} -o=jsonpath='{.data.dbname}' | base64 -d)
    SCHEMA=$(oc get secret ${EBDSECRET} -o=jsonpath='{.data.schema}' | base64 -d)
    
    # RSH into EDB container.
    oc rsh ${EDBPOD} psql -h localhost -p 5432 -U ${USERNAME} "dbname=${DBNAME} sslmode=require" -- <<EOF
    ${PASSWORD}
    
    DELETE FROM aiops_irb.alerts WHERE state=2;
    DELETE FROM aiops_irb.insights insight WHERE NOT EXISTS(SELECT 1 FROM aiops_irb.alerts alert WHERE insight.parentid = alert.alertid) AND NOT EXISTS(SELECT 1 FROM aiops_irb.stories story WHERE insight.parentid = story.storyid);
    DELETE FROM aiops_irb.links link WHERE NOT EXISTS(SELECT 1 FROM aiops_irb.alerts alert WHERE link.parentid = alert.alertid) AND NOT EXISTS(SELECT 1 FROM aiops_irb.stories story WHERE link.parentid = story.storyid);
    DELETE FROM aiops_irb.timeline timeline WHERE NOT EXISTS(SELECT 1 FROM aiops_irb.alerts alert WHERE timeline.parentid = alert.alertid) AND NOT EXISTS(SELECT 1 FROM aiops_irb.stories story WHERE timeline.parentid = story.storyid);
    DELETE FROM relationships relation WHERE NOT EXISTS(SELECT 1 FROM aiops_irb.alerts alert WHERE relation.childid = alert.alertid) AND NOT EXISTS(SELECT 1 FROM aiops_irb.stories story WHERE relation.parentid = story.storyid);
    DELETE FROM aiops_irb.alerts_details detail WHERE NOT EXISTS(SELECT 1 FROM aiops_irb.alerts alert WHERE detail.identifier = alert.identifier);
    EOF
    

    where POSTGRES_PODNAME is the PostgreSQL pod name that you noted in step 3.

    The script deletes all closed alerts from the Alert Viewer. The job manager will also be able to remove the future closed alerts automatically.

    A sample output from successful execution of the script:

    Defaulted container "postgres" out of: postgres, bootstrap-controller (init)
    Password for user aiops_irb:
    DELETE 0
    DELETE 0
    DELETE 0
    DELETE 0
    DELETE 0
    DELETE 0
    
    

    The number after the DELETE keyword is the number of rows that are deleted.

"Unknown error" message in Alert Viewer and no alerts displayed

Symptom: No alerts are shown in the Cloud Pak for AIOps Alert Viewer and "Unknown error" message is displayed on the UI.

Cause: The problem might be that the number of alerts exceeds what the Alert Viewer can process. A production deployment of Cloud Pak for AIOps can process a standing alert count of up to 200,000 alerts, with a chosen subset displayed in the Alert Viewer. The precise number of alerts that can be shown depends on different permutations of variables that affect the performance of the Alert Viewer. For example, standing alert count, filtered alert count, and view used (such as simple columns with alert data, or complex columns with insights and custom user enriched columns).

Solution: At first (during deployment or creation of a new view), try excluding columns with insights or columns based on custom properties. You can then add these columns sequentially to better gauge how many alerts a view can handle. For more information, see Creating views.

You can also restrict the number of alerts that are returned by the data layer to the Alert Viewer. This can improve performance issues on the backend. For more information, see Restricting the number of alerts returned by the data layer to the Alert Viewer.

Alert list ignores saved view sort order

Symptom: The alerts table sort order you wanted to apply to it does not save as expected.

Cause: It might be related to a similar issue, which deals with editing the alert view and how it then does not show the correct configuration when launched for the first time. This is due to having to fetch users and groups before being able to show the right view configuration. This is being fixed now also, so should no longer be an issue in future releases.

Solution: If the original alert table sort order didn't save as expected, you have to apply the sort order again after the alerts table is loaded.

"Unknown error" message in Alert Viewer when you click the alert view without selecting the ID and Summary columns

Symptom: You may get the unknown error when you click the alert view without having selected the ID and SUMMARY columns.

Cause: This problem is because the view requires the ID and SUMMARY columns to be part of the view. If these columns are not included in the view then it will be unusable and the only way to delete that view will be by using the API.

Solution: Include the ID and SUMMARY columns in the alert view. If you do not and cannot then delete the unusable view that results, you will have to delete the view using the API.