Troubleshooting integrations

Review frequently encountered issues with creating and managing outgoing and incoming integrations.

Multiple integrations

Integrations do not initialize

If an integration gets stuck in a state that is not running, there may be an issue with the secrets being created.

To check this log in to openshift and check the pods. For example, the netcool integration has the following issue:

NAME                                                              READY   STATUS              RESTARTS        AGE
netcool-conn-cfe182e0-d7eb-42f6-9677-fb86b180bd0f-8bd546cbvv5km   0/1     ContainerCreating   0               27m

If you describe the pod oc describe pod netcool-conn-cfe182e0-d7eb-42f6-9677-fb86b180bd0f-8bd546cbvv5km and you see a mount error in the events log such as the following:

34s (x20 over 25m)         Warning   FailedMount                       Pod/netcool-conn-cfe182e0-d7eb-42f6-9677-fb86b180bd0f-8bd546cbvv5km             MountVolume.SetUp failed for volume "grpc-bridge-service-binding" : secret "connector-cfe182e0-d7eb-42f6-9677-fb86b180bd0f" not found

you can conclude that the secrets failed to create. The way to fix this is to use the following command:

oc get pods | grep connector-manager

Then delete the connector-manager pods and restart it. On restarting it, the secrets are re-created for all integrations.

Integrations page

On viewing the Integrations page, you might encounter the following issues:

Deleting an integration does not remove the PVC for the integration

For StatefulSet type integrations, when the integration is deleted the Persistent Volume Claim (PVC) resource for the integration is not automatically deleted. You must manually delete the resource with the Red Hat OpenShift oc CLI or with the Red Hat OpenShift UI.

This behavior occurs because the StatefulSet PVC is created by specifying the volumeClaimTemplate, but the created PVC is not automatically deleted by Red Hat OpenShift when the StatefulSet resource is deleted.

To delete a PVC resource with the Red Hat OpenShift oc CLI, complete the following steps:

  1. Get the list of PVC resources for your current integrations:

    oc get pvc | grep 'conn'
    
  2. Select the PVC resource that you want to delete and record the name.

  3. Delete the PVC resource:

    oc delete pvc <pvc-name>
    

    Where <pvc-name> is the name of the PVC resource that you want to delete.

Error displays when you activate Kafka, Humio, Splunk, Mezmo, ELK, PagerDuty, ServiceNow integrations

When you activate a Kafka, Falcon LogScale, Splunk, Mezmo, ELK, PagerDuty, ServiceNow integration in the console, an error can occur when the required Flink task slots to run the integration are not available. When this error occurs, the following message is displayed:

Failure: Not enough task slots to activate the connection.

This error occurs as there is a finite number of Flink task slots to use for creating integrations to remote resources. The error message displays the number of slots that are needed to support the current active integration set, along with the total number of slots available.

Solution: To activate the integration, you must either reduce the number of slots that are needed by this (or another) integration, or allocate more slots. To decrease the number of slots needed by a integration, consider reducing the "base parallelism" or "logs per second" value for the integration. This reduced value decreases the rate at which data flows through the integration, making more slots available to other integrations. To allocate more slots, see Increasing data streaming capacity.

Integrations on the Integrations page show a red 'Unknown' status

All integrations are showing red 'Unknown' integration status and data is not flowing after creating a new integration, such as a Mezmo integration. For the new integration if you enable 'Data collection', you can see that the data is not flowing, and the integration shows the same red 'Unknown' status in the Data collection status field instead of the expected 'Running' status. The common cause for this is unstable Kafka in the background.

Solution: If this issue occurs, restart the bridge and operator pod.

Integrations UI pod in CrashLoopBackOff for the Integrations page

You can encounter an intermittent issue that causes the Integrations page to fail and become unavailable for viewing.

Solution: If the page does not automatically recover, you can increase the timeout setting for the aimanager-aio-controller deployment to help prevent the issue. To increase this setting, edit the aimanager-aio-controller deployment (liveliness and healthcheck probe) and set it to a longer timeout of say 15 seconds, up from the default 5 seconds. This change can stop the Connections-ui pod from going into CrashLoopBackOff.

  1. Check the aimanager-aio-controller

    oc get pods | grep aio-controller
    aimanager-aio-controller-6d9f68d5f5-8dkk9                         1/1     Running     0               27h
    
  2. Edit the aimanager-aio-controller's deployment

    oc edit deployment aimanager-aio-controller
    
  3. Edit these 2 variables (timeoutSeconds) for both the livenessProbe and the readinessProbe to 15 seconds:

    livenessProbe
    - timeoutSeconds
    readinessProbe
    - timeoutSeconds
    

Integrations page becomes unresponsive after some time or shows an error

If you are using IBM Cloud Pak for AIOps, you might encounter an unresponsive Integrations page around 24 hours after you install your cluster. The integrations page might take a long time to load and then display no integrations despite configuring them earlier. Another symptom of this problem is that you might fail to create an integration and see the following error:

Failed submitting creating request

Solution: To ensure the Integrations page works normally, delete the aimanager-aio-controller pod:

for n in $(oc get pods |grep aimanager-aio-controller |awk '{print $1}')
do
oc delete pod $n
done

Within one to two minutes of running the script, the pod will automatically restart and the Integrations page will work as usual.

Incorrect integration status on the Integrations page

You might encounter an incorrect integration status for all gRPC integrations (Instana, Netcool, AppDynamics, Dynatrace, AWS CloudWatch, Splunk, Zabbix, and New Relic) on the Integrations page. The integration status might appear as Unknown when it must be Running. The incorrect status might be the result of the pod connector-bridge being in CrashLoopBackOff state and restarting multiple times due to an unstable Kafka integration.

Solution: If this issue occurs, check the Kafka and zookeeper pods for any restarts. Kafka integration becomes unstable and restarts when the PVC used for Kafka is full. Review the current capacity of the PVC and increase the capacity as needed to resolve the issue. For more information about adjusting the size of the PVC, see Increasing Kafka PVC.

The integration keeps showing Error or Retrying and integration pod is stuck in Pending

On the Integrations UI page, an integration keeps showing Error or Retrying and the integration pod is stuck in Pending. This issue occurs because the Persistent Volume Claim (PVC) is requesting a volume from a storage provider that does not support the ReadWriteOnce (RWO) access mode.

Solution: You can use either one of the following methods to unblock the stuck integration pod:

  • Change the default storage class

Or

  • Re-create the Persistent Volume Claim with a different storage class

Method 1: Change the default storage class. Review the default storage class in the cluster and change to a storage class that supports ReadWriteOnce (RWO) access mode, such as file storage types.

Note: This method should resolve integration PVC issues but since it changes the global default storage class type in the cluster, it might also affect other workloads that rely on the default storage class settings when requesting a volume.

  1. Run the following command to determine which is the current default storage class.

    oc get storageclass
    
  2. Note the name of the default storage class (default).

  3. Change the default storage class to one that supports the ReadWriteOnce (RWO) access mode like file storages. The following command sets the storageclass.kubernetes.io/is-default-class annotation to true on the new storage class and sets the annotation to false on the old storage class. (Change the <new-default-storageclass-name> and <old-default-storageclass-name> with the actual storage class names in your cluster.)

    oc patch storageclass <old-default-storageclass-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
    oc patch storageclass <new-default-storageclass-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
    
  4. Verify that the default storage class is updated by running

    oc get storageclass
    
  5. Log in to the Cloud Pak for AIOps console and navigate to the Integrations page.

  6. Delete the existing integration, then reinstall it.

  7. Verify that the integration enters a Running state.

Method 2: Re-create the Persistent Volume Claim with a different storage class. Delete the existing PVC and re-create a new one with the same name that uses a storage class that supports the ReadWriteOnce (RWO) access mode.

Note: This method updates only a single integration instance. Repeat this step for all integrations unable to request a persistent volume, due to the incorrect storage class type.

  1. Get the PVC name of the integration. You can list all PVCs by using oc get pvc. Look for the one that is still in Pending state with your integration name.

  2. Delete the existing PVC and re-create a new one with the same name. Review the following sample command as reference. Update the <connector-pvc-name> with the PVC name from the previous step and set <storage-class-name> with the storage class that supports the required ReadWriteMan (RWX) access mode.

    # Set the PVC name
    CONNECTOR_PVC_NAME=<connector-pvc-name>
    
    # Set new storage class
    STORAGE_CLASS=<new-storage-class>
    
    CONNECTOR_APP=$(oc get pvc $CONNECTOR_PVC_NAME -o jsonpath='{.metadata.labels.app}')
    CONNECTOR_ID=$(oc get pvc $CONNECTOR_PVC_NAME -o jsonpath='{.metadata.labels.instance}')
    PVC_SIZE=$(oc get pvc $CONNECTOR_PVC_NAME -o jsonpath='{.spec.resources.requests.storage}')
    
    oc delete pvc $CONNECTOR_PVC_NAME && \
    cat << EOF | tee >(oc apply -f -) | cat
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
        name: $CONNECTOR_PVC_NAME
        labels:
            app: $CONNECTOR_APP
            app.kubernetes.io/instance: aiopsedge
            app.kubernetes.io/managed-by: aiopsedge-operator
            app.kubernetes.io/name: ibm-aiops-edge
            app.kubernetes.io/part-of: ibm-aiops-edge
            instance: $CONNECTOR_ID
    
    spec:
        accessModes:
        - ReadWriteOnce
        resources:
            requests:
                storage: $PVC_SIZE
        storageClassName: $STORAGE_CLASS
        volumeMode: Filesystem
    EOF
    

    For example,

    # Set the PVC name
    CONNECTOR_PVC_NAME=netcool-vct-netcool-conn-325fba02-440f-4135-bd00-8560bcf5eda1-0
    
    # Set new storage class
    STORAGE_CLASS=rook-cephfs
    
    CONNECTOR_APP=$(oc get pvc CONNECTOR_PVC_NAME -o jsonpath='{.metadata.labels.app}')
    CONNECTOR_ID=$(oc get pvc CONNECTOR_PVC_NAME -o jsonpath='{.metadata.labels.instance}')
    
    oc delete pvc $CONNECTOR_PVC_NAME && \
    cat << EOF | tee >(oc apply -f -) | cat
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
        name: $CONNECTOR_PVC_NAME
        labels:
            app: $CONNECTOR_APP
            app.kubernetes.io/instance: aiopsedge
            app.kubernetes.io/managed-by: aiopsedge-operator
            app.kubernetes.io/name: ibm-aiops-edge
            app.kubernetes.io/part-of: ibm-aiops-edge
            instance: $CONNECTOR_ID
    
    spec:
        accessModes:
        - ReadWriteOnce
        resources:
            requests:
                storage: 50Mi
        storageClassName: $STORAGE_CLASS
        volumeMode: Filesystem
    EOF
    
  3. Verify that the integration pod transitions to Running state.

  4. If necessary, repeat the steps above on any other integration that is producing the same problem.

Metrics integrations

If you are creating, or have, a metrics integration, you might encounter the following issue:

Metrics connection is unable to pull further historical data when historical mode is completed

When a metrics integration (for example Dynatrace, Appdynamics, Zabbix, New Relic, Splunk, or AWS) has completed its historical data pulling, historical mode can never be restarted again even if the connection configuration is edited or updated from the UI to specify a different start date, end date or a different set of metrics.

Solution: If you need to pull historical data again for another set of metrics or different dates, you can create a new connection.

Instana integrations

If you are creating, or have, an Instana integration, you might encounter the following issues:

integration is unable to update the metric list in the UI

The Instana integration UI page displays a list of metrics from Instana that are updated when the Instana integration pod runs. Some Instana hosts can have many technology plug-ins installed, which collect many built-in and custom metrics. These metrics can cause the Instana integration to exceed the allowed message limit when the integration sends an update to the UI. If this limit is exceeded, the Instana integration pod logs the following message:

0000004d Connector     I   UI configuration patch exceeds the safe message limit, skip sending configuration patch. The metrics list in the UI could not be updated with the actual list of technologies and metrics. Please configure more filter patterns to reduce the number of metrics to query.

The integration logs each technology plug-in and the number of metrics that are included or filtered.

Solution: Apply additional filter patterns in the Instana configmap for the integration to exclude when querying the metric list from Instana.

  1. Get the Instana pod logs to determine which technology plug-in contains many metrics

    oc logs $(oc get pod | grep 'instana-conn' | awk '{print $1}') | grep -e "Plugin \'.*\' \(.*\)"
    

    The following example log indicates that a aceMessageFlow technology plug-in that contains a large list of metrics, and that is a candidate to filter.

    Tip: Look for any technology plug-ins that contain more than 100 metrics.

    0000004d InstanaInstan I Plugin 'ACE Message Flow' (aceMessageFlow) metrics: 2,666 included, 0 filtered.
    
  2. Determine the available metrics names by technology from the Instana API or from your Instana administrator. To use the Instana API, you can use a command similar to the following sample command, which queries aceMessageFlow metrics:

    INSTANA_ENDPOINT=<your Instana host>
    INSTANA_TOKEN=<your Instana API token>
    curl --request GET -s  --url ${INSTANA_ENDPOINT}/api/infrastructure-monitoring/catalog/metrics/aceMessageFlow \
    --header "authorization: apiToken ${INSTANA_TOKEN}" \
    | jq 'group_by(.pluginId)[]|  {(.[0].pluginId): [.[] | .metricId] | sort}'
    
  3. Create a pattern to filter metrics from the output of the previous command. For example, flowNodes\\..* to filter all metrics with flowNodes. prefix.

  4. Get the Instana integration configmap name to add the filter pattern from the previous step.

    Note: If multiple Instana integrations exist, select the configmap with the UID that matches the pod name.

    oc get configmap | grep instana-connector
    
  5. Extract the exclude-filter-patterns from the configmap. This command saves the content of the configmap into the local file:

    oc extract configmap/ibm-grpc-instana-connector-d59eea9f-501d-4bd1-b3a3-6f98b49b0930 --keys=exclude-filter-patterns --to=.
    
  6. Edit the local exclude-filter-patterns file to add the metric pattern in addition to the default ones that are defined in the filters array.

    Note: To escape a slash \ character, you must use double \\. For example:

    {"filters":["fs\\./dev/.*","fs_mount\\./.*","metrics\\.counters\\..*\\..*\\..*","(.*\\.)?metrics\\.gauges\\..*\\..*\\..*","metrics\\.meters\\..*","(.*\\.)?metrics\\.timers\\..*","prometheus\\.metrics.*","flowNodes\\..*"]}
    
  7. Apply the change to the configmap:

    oc set data configmap/ibm-grpc-instana-connector-d59eea9f-501d-4bd1-b3a3-6f98b49b0930 --from-file=exclude-filter-patterns
    
  8. Verify that the new filter pattern is loaded by the pod, and that the number of metrics that are filtered for the technology plug-in is changed.

    oc logs $(oc get pod | grep 'instana-conn' | awk '{print $1}') | grep 'Loaded filter patterns from ConfigMap'
    
  9. Verify that the Instana integration can send an update to the UI successfully:

    oc logs $(oc get pod | grep 'instana-conn' | awk '{print $1}') | grep -e 'Status message .* write with configuration patch .* successful.'
    
  10. In the Edit Instana integration UI page, the list of technologies and metrics list is now updated.

IBM Cloud Pak for AIOps cannot close some alerts when an Instana integration exists

When an integration with Instana exists, IBM Cloud Pak for AIOps might not be able to close some alerts. Instead, you might need to check the status for the associated event within the Instana UI and clear the alert manually.

Solution: To find the alert URL for clearing the alert, complete the following actions:

  1. From the IBM Cloud Pak for AIOps console, open the Incidents and alerts tool.

  2. Select the alert in the table to display the Alert details side panel for that alert. The Information section is open by default in the side panel.

  3. Copy the URL from links and Paste the URL into your browser.

  4. If you are an Instana authorized user, you see the associated event in the Instana UI.

If the event is not active and has "end-time" for the event in the Instana UI, clear the alert:

  1. Click Alert to see the alert details.
  2. Click Clear Alert to clear the alert.

After a repair of Instana, alerts in the Instana application do not associate with the corresponding IBM Cloud Pak for AIOps application

When you restart Instana post repair, alerts in the Instana application do not map with the corresponding application in IBM Cloud Pak for AIOps. The reason is that Instana creates new identifiers for the resources on its application on restarting, without updating the topology in IBM Cloud Pak for AIOps. As a result, the IDs of the resources do not match leading to the disassociation of alerts.

Solution: To map the alerts, you must complete the following steps:

  1. Clean up the application in Instana.
  2. Re-create the application after a few minutes.

To clean up and recreate applications in Instana, see Getting started with Instana.

Due to an out-of-memory error the Instana integration pod keeps restarting when retrieving data from Instana

When the Instana integration receives a large amount of data from one or more Instana APIs while the integration is observing many resources, and there is insufficient memory to process this large amount of data, an out of memory error can occur.

Solution: If this error occurs, you can apply custom memory settings to allocate more memory to the Instana integration pod by increasing the memory limit setting. This allocation can only be done post-installation as the integration must exist before applying the custom setting. To apply custom resource settings, follow the steps to patch the Instana integration:

Prerequisite: Ensure you have the yq and jq packages installed on your system before running the commands below. To get installation instructions for these packages, refer to the documentation or website for yq and jq.

  1. From a command line, change to your IBM Cloud Pak for AIOps project (namespace):

    namespace=<aiops-namespace>
    oc project $namespace
    
  2. Get the display name for the Instana integration you would like to increase resources for from the command line:

    oc get connectorconfiguration
    
  3. Set the required variables:

    connectionname=<instana-connection-name>
    connconfig=$(oc get connectorconfiguration --no-headers | grep $connectionname | awk '{print $1}')
    connconfiguid=$(oc get connectorconfiguration $connconfig -o jsonpath='{.metadata.uid}')
    gitappname=$(oc get gitapp -l connectors.aiops.ibm.com/connection-id==$connconfiguid --no-headers | awk '{print $1}')
    

    Note: Update the connectionname variable with your Instana display name from the command above.

  4. Run this command to create a variable with a patch specification for the Instana integration. This will apply only to one Instana integration at a time. This command should not have an output. Copy this command directly from the documentation and do not paste it into a text editor because this can affect the format.

    jsonpatch=$(yq -o json <<EOF
        patch: |-
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: ibm-grpc-instana-connector
          spec:
            template:
              spec:
                containers:
                - name: ibm-grpc-instana-connector
                  resources:
                    requests:
                      memory: 1536Mi
                      ephemeral-storage: 2Gi
                    limits:
                      memory: 8Gi
                      ephemeral-storage: 4Gi
        target:
          group: apps
          kind: Deployment
          name: ibm-grpc-instana-connector
          version: v1
    EOF
    )
    
  5. Run the following command to apply the custom resource setting that you created in the previous step:

    oc get gitapp $gitappname  -n $namespace -o json | jq ".spec.components[0].kustomization.patches += [$jsonpatch]" | oc apply -f -
    

    Note: If you see an error message when running this command that states there was a configuration conflict, try running the command a second time.

  6. It takes a moment for the changes to show in the integration pod.

Note: If the memory limit is not updated for the Instana integration pod after a few minutes, run oc get gitapp $gitappname and check to see if there are any errors. If there is an error stating that the YAML is invalid, run the command below to revert the patch and go through all steps again. The cause of this error is that the format of the commands may have been altered by a text editor and should be copied directly from the documentation and run directly in the terminal.

Revert the custom resource settings to default resource settings:

To revert from custom to default resource settings, run the following command:

oc patch gitapp $gitappname -n $namespace --type='json' -p="[{'op': 'remove', 'path': '/spec/components/0/kustomization/patches'}]"

Kafka integrations

If you are creating, or have, a Kafka integration, you might encounter the following issues:

Kafka events not displayed in IBM Cloud Pak for AIOps

You cannot see Kafka events in IBM Cloud Pak for AIOps after successfully creating a Kafka integration. This failure occurs when the Kafka key or message violates the format requirements. The MismatchedInputException error in the task manager pod log indicates this formatting error.

Solution: To ensure that the Kafka events display in IBM Cloud Pak for AIOps, check whether the following conditions are met:

  1. The key in Kafka payload is null.

    Note: The key is determined by Kafka producer. Most producer tools use null object as key unless the producer is programmed to assign value.

  2. The message in Kafka payload is a JSON message that satisfies the following requirements:

    • Exists in the format of the selected data source.
    • Contains sufficient data for the mapping.

After copying log data into a Kafka cluster, verify that the data is imported into Elasticsearch

To publish log data, you can copy it into a Kafka cluster. For more information, see Copying log data into the Kafka cluster. Then, you can check whether the log data is imported into Elasticsearch.

Solution: Verify that the log data is imported into Elasticsearch by getting shell access to the ai-platform-api-server pod. Run the following commands from the namespace where IBM Cloud Pak for AIOps is installed:

oc exec -it $(oc get pods | grep ai-platform-api-server | awk '{print $1}') -- /bin/bash

Example:

curl -X GET -u $ES_USERNAME:$ES_PASSWORD $ES_URL/_cat/indices -k | sort

Example output:

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                              Dload  Upload   Total   Spent    Left  Speed
100  4710  100  4710    0     0  28373      0 --:--:-- --:--:-- --:--:-- 28373
green  open .security                       L0Gou7AXSC6PyuZ8-skGRQ 1 0    1 0   4.6kb   4.6kb
green  open aiops-searchservice_v10         Ymls6nL3S-KP4vC4ROdRRw 5 0    0 0     1kb     1kb
yellow open 1000-1000-20200803-logtrain     QImZoOd7QiuM3mfxUIdFAA 8 2   24 0  34.5kb  34.5kb
yellow open 1000-1000-20200804-logtrain     54m5F8K6Q3iIbSZ4KvMlYQ 8 2  286 0  80.8kb  80.8kb
yellow open 1000-1000-20200805-logtrain     I_GhHFz9Th62KV_7wGwyMA 8 2  137 0  61.1kb  61.1kb
yellow open 1000-1000-20200806-logtrain     7HZcyCaLRa2VoOwJrH7Ftg 8 2  292 0  76.2kb  76.2kb
yellow open 1000-1000-20200807-logtrain     LzUFhGcGSs-Rp_uJtw17sw 8 2  686 0 126.1kb 126.1kb
yellow open 1000-1000-20200808-logtrain     HR7n9NIATnm9MOjgnq2K_Q 8 2  268 0    77kb    77kb
yellow open 1000-1000-20200809-logtrain     NSU1X31OTq2coxdo-_I1rQ 8 2   48 0  48.7kb  48.7kb
yellow open 1000-1000-20200810-logtrain     2M54aDMXRyOp_06XnInTEg 8 2  885 0 140.3kb 140.3kb
yellow open 1000-1000-20200811-logtrain     pQjyF1pQTX6GT70WY7T7ZQ 8 2  797 0 132.2kb 132.2kb
yellow open 1000-1000-20200812-logtrain     U0YasxsqTpuricpyyfQ62Q 8 2 1294 0 201.6kb 201.6kb
yellow open 1000-1000-20200813-logtrain     pZA79rSSTYeTrsRuT8K44g 8 2 1155 0 178.3kb 178.3kb
yellow open 1000-1000-20200814-logtrain     eu5qqE31TcGtz5yjVKxlnQ 8 2  720 0   134kb   134kb
yellow open 1000-1000-20200815-logtrain     rKHrLUooSP-lx6r_ettalQ 8 2   42 0  46.2kb  46.2kb
yellow open 1000-1000-20200816-logtrain     0bRKhqgdQviTBc0FnJ4pvg 8 2   58 0  49.9kb  49.9kb
yellow open 1000-1000-20200817-logtrain     PMrlHzjZQymb5hXiZanhOw 8 2  287 0  99.3kb  99.3kb
yellow open 1000-1000-20200818-logtrain     ItLXqZIxQWy7pirLCPcfKw 8 2  483 0 103.7kb 103.7kb
yellow open 1000-1000-20200819-logtrain     W2ti9RlVTRiW3ZpynEZ8wQ 8 2  879 0 141.8kb 141.8kb
yellow open 1000-1000-20200820-logtrain     Zsz7Vd0rQpSuVwbUi4AfQw 8 2  488 0  98.7kb  98.7kb
yellow open 1000-1000-20200821-logtrain     VVcsZ1HEQJi08WrjX6CJ_Q 8 2  268 0  43.4kb  43.4kb
yellow open algorithmregistry               wLjGMGL4SI-weK666yr8pA 1 1    8 0  26.2kb  26.2kb
yellow open cp4waiops-cartridge-poli.....   6dQclBWSQA6FR3yRurpxVQ 1 1   17 0  66.8kb  66.8kb
yellow open dataset                         TA3TB-MNTkScyxDNZxuEcw 1 1    0 0    208b    208b
yellow open postchecktrainingdetails        G8lHdh9WQ8OxIHhOAwBUnw 1 1    0 0    208b    208b
yellow open prechecktrainingdetails         WA00PRwuTgWLsQNvcBcoug 1 1    0 0    208b    208b
yellow open trainingdefinition              dC2UuOLiR_qGiax-8tEfEg 1 1    0 0    208b    208b
yellow open trainingrun                     fRqCLpQ1TCi_kt4zzNvwDA 1 1    0 0    208b    208b
yellow open trainingsrunning                gzyQzqP7SJuy9dbUZI1dVA 1 1    0 0    208b    208b
yellow open trainingstatus                  AiLkAg0yRQKzzIpbsk0CCw 1 1    0 0    208b    208b

Netcool integrations

If you are creating, or have, a Netcool integration, you might encounter the following issues:

Runbook automation pod "rba-rbs" crashes during an alert burst, for example, while connecting to Netcool

The "rba-rbs" pod crashes and gets restarted automatically, and most of the fully automated runbooks that are associated with the new alerts are not run. This restart behavior might be observed multiple times until all the events from the event burst have been ingested, and the system reaches its regular state of operation again.

Solution: During an event burst like the one that occurs when connecting to a Netcool server, all enabled policies are evaluated, and for the matching events, the associated fully automated runbooks are started. If many matching events exist, these events might result in a high load of concurrent runbook invocations that the runbook service rba-rbs is not able to handle. The system recovers automatically. However, some runbooks will not be run during the burst.

ObjectServer alerts missing in IBM Cloud Pak for AIOps due to a No object found with id error in the data layer

If a new alert is rejected by the data layer, subsequent update alerts may be rejected by the data layer with the error: No object found with id being found in the cp4waiops-cartridge.irdatalayer.errors Kafka topic.

Solution: The AIOpsAlertId column must be cleared so that when its updates reach the connector, the data will be perceived as a fresh alert by IBM Cloud Pak for AIOps.

To do this, use the following steps:

  1. Disable the connector's data collection (this will trigger the gateway to shut down).
  2. Go to the connector pod, remove the gateway cache file and SAF file under the following directory:
    /bindings/netcool-connector/omnibus/var/G_CONNECTOR
    
  3. In the ObjectServer:
    1. Rectify the data columns that had resulted in wrongly mapped values.
    2. Clear the AIOpsAlertId column of the alert.
  4. Enable the connector's data collection.

When processing the next update event to that alert, the connector will see an empty AIOpsAlertId and will generate "type": "update" in the northbound CloudEvent payload.

Setting values for mandatory fields for events sent to the embedded ObjectServer

If you are manually inserting events into the embedded ObjectServer using nco_sql, you must set values for the following fields so that the events can be processed correctly:

  • Node
  • AlertKey
  • AlertGroup
  • Manager

Note: This is only a concern if you are creating an event manually using an insert statement. All IBM Netcool Operations Insight probes will set all required columns.

Events from IBM Tivoli Netcool/OMNIbus are not updated in Alert Viewer

You might notice an issue where events from IBM Tivoli Netcool/OMNIbus reach Cloud Pak for AIOps, but the updates are not reflected in the Alert Viewer. The reason for this issue is that the Netcool connector might have restarted. The connector restart causes existing alerts in Cloud Pak for AIOps to dissociate from the origin alerts in the ObjectServer.

Solution: Upgrade Cloud Pak for AIOps to version 4.5.1 or newer. After the upgrade, the new alerts are not impacted by the Netcool connector restart.

Test connection fails in the Netcool connector UI configuration page

You might notice that the test connection fails in the Netcool connector UI configuration page.

Solution: Use the following steps to diagnose and resolve the connection failure.

  1. Set the IBM Netcool Operations Insight ObjectServer log message level to debug. This can be done by updating the MessageLevel property in the $OMNIHOME/etc/{ObjectServer_name}.props file on the Netcool host. ObjectServer_name is the name of IBM Netcool Operations Insight ObjectServer.

  2. Click test connection in the Netcool UI. Check if the IBM Netcool Operations Insight ObjectServer login attempt is received.

    • If no log in attempt is seen in the IBM Netcool Operations Insight ObjectServer log, check whether the IBM Netcool Operations Insight ObjectServer host IP that is resolved on Cloud Pak for AIOps matches the IBM Netcool Operations Insight ObjectServer host IP.

    • If a log in attempt is seen but rejected by the OS authentication, check the Transport Layer Security (TLS) certificate, especially the Common Name (CN) value, which must match the fully qualified domain name (FDQN) of the server.

  3. If the Connection manager is down or not responsive to the request to connect to the IBM Netcool Operations Insight ObjectServer, then restart the Connection manager pod.

  4. For an on-premises IBM Netcool Operations Insight ObjectServer integration, you might notice these issues:

    • If the CN value in the TLS certificate does not match the host server FQDN, create a TLS certificate that contains the host FQDN.

    • If a firewall restriction exists on the IBM Netcool Operations Insight ObjectServer port, then update the firewall rules to unblock the port.

Fresh alerts are missing in Alert Viewer

You might notice an issue where the most recent alerts are not showing in Alert Viewer. You need to inspect the data in different stages of the pipeline to diagnose the problem.

When an alert is missing, you can scan the north-bound Kafka topics for the alert message. The following example shows the direction of the north-bound data flow:

Netcool connector -> cp4waiops-cartridge.lifecycle.input.alerts [topic] -> cp4waiops-cartridge.irdatalayer.requests.alerts [topic] -> Alert Viewer

Where:

  • [topic] refers to a Kafka topic.
  • cp4waiops-cartridge.lifecycle.input.alerts is a Kafka topic where all the new alerts are collected.
  • cp4waiops-cartridge.irdatalayer.requests.alerts is a Kafka topic. cp4waiops-cartridge.irdatalayer.requests.alerts is one of the outcomes by the lifecycle.

Solution: Use the following steps to diagnose and resolve the missing alerts issue.

Note: Download and extract the Ext_IAFKafkaTools.tar.gz file located in Download Ext_IAFKafkaTools.tar.gz. After you extract the Ext_IAFKafkaTools.tar.gz file, follow the steps in the README.txt file.

  1. Check whether an alert message appears in the cp4waiops-cartridge.irdatalayer.errors topic. Look for the alert message and run the following command.

    consumer.sh | grep <pattern>
    

    Inspect the cause of the error in the message.

  2. Check whether the alert message appears in the cp4waiops-cartridge.lifecycle.input.alerts topic.

    The following example shows an alert message with type: update.

    {"tenantid":"cfd95b7e-3bc7-4006-a4a8-a73a79c71255","requestid":"b6079c67-8b2c-4a84-ae1d-9e50df05bfa1","requestTime":"2024-04-16T03:23:37.443484418Z","type":"update","entityType":"alert","entity":{"insights":[{"details":{"lastProcessedEventOccurrenceTime":"2024-04-16T03:23:33Z"},"id":"event-occurrence","type":"[aiops.ibm.com/insight-type/deduplication-details](https://aiops.ibm.com/insight-type/deduplication-details)"}],"deduplicationKey":"{hostname=null, name=null}-UserLogoutSessionEvent-","eventCount":84341,"state":"open","id":"9e96637d-15cf-4618-bdf7-ef97e6016128","type":{"classification":"UserLogoutSessionEvent","eventType":"problem"},"lastOccurrenceTime":"2024-04-16T03:23:33.000Z"}}
    
  3. If a new alert message appears in cp4waiops-cartridge.lifecycle.input.alerts and fulfills the positive criteria of type: create, check whether the message is delivered to cp4waiops-cartridge.irdatalayer.requests.alerts.

    If the message is not found in the requests.alerts topic, the message is either blocked or rejected by the lifecycle process or not yet consumed from the input.alerts topic.

  4. If the message passes through cp4waiops-cartridge.irdatalayer.requests.alerts, and is not reflected in the Alert Viewer, check the data layer process.

  5. Restart the pod of the processes that are not working based on the error in the message. The pods that might need a restart depending on the error can be Grpc server, Life cycle processor, and Kafka broker.

  6. If Kafka topics storage limitation is an issue, update the Kafka storage size.

ServiceNow integrations

Incidents aren't created

Incidents might not be created for various reasons.

Solution: Complete the following checks to help ensure that incidents can be created.

  1. Validate the ServiceNow URL and credentials. Go to the ServiceNow integration page, select the integration, and select Test connection. If the test fails, check that the URL and credentials are correct.
  2. Make sure that the versions of Cloud Pak for AIOps and the IBM ServiceNow marketplace applications are compatible. For more information about which version of Cloud Pak for AIOps is in use, see Prerequisites in Creating a ServiceNow integration.
  3. Make sure that a policy is configured to use the existing ServiceNow connection. For more information about creating policies, see Promote alerts to an incident.
  4. Make sure that the ServiceNow integration user has the correct permissions. For more information, see Managing roles.
  5. Incident tables can have business rules that prevent incident creation. Check business rules and change them so that they don't affect incident tables.

Splunk integrations

If you are creating, or have, a Splunk integration, you might encounter the following issues while creating or editing the integration:

All fields in the Splunk form reset if edited while loading metrics

If you edit the fields in the Splunk form while the metrics are loading in the integration console, then you cannot save the changes made to the form.

Solution: To avoid this issue, do not modify the fields in the Splunk form while metrics are loading. If you do so, then redo the changes after the metrics are loaded to save the edits.

Splunk integrations that are created before upgrade fail to update after upgrade

On upgrading from 3.5.0, you might not be able to enable or disable log or metric data flows on Splunk integrations that were created before upgrade. This issue can occur intermittently.

Solution: When this issue occurs, you can refresh the console and try again after some time. Alternatively, you can create a new integration and later delete the Splunk integration that was defined before upgrade.

IBM Tivoli Netcool/Impact integrations

If you are creating, or have, a IBM Tivoli Netcool/Impact integration, you might encounter the following issues:

IBM Tivoli Netcool/Impact integration status shows "Retrying"

After Creating IBM Tivoli Netcool/Impact integrations, you might see a Connection Orchestration status of "Done", but the IBM Tivoli Netcool/Impact integration shows an error with a status of "Retrying".

The reason for this error might be a locked IBM Tivoli Netcool/Impact file that is preventing the creation of the IBM Cloud Pak for AIOps data model in IBM Tivoli Netcool/Impact. If this is the issue, the impactserver-errors logs in IBM Tivoli Netcool/Impact will display an error similar to the following:

   30 Jun 2023 03:27:21,137 ERROR [DataModelUIResource] File etc/NCIP_datasourcelist is locked by user: root.

Solution: Locate and delete a file called <servername>_versioncontrol.locks in IBM Tivoli Netcool/Impact and restart the IBM Tivoli Netcool/Impact server and GUI.

  1. Log in to the IBM Tivoli Netcool/Impact host and navigate to the $IMPACT_HOME/etc directory. Locate a file called <servername>_versioncontrol.locks. For example, NCIP_versioncontrol.locks.

  2. Delete the file.

  3. Restart the Impact server and GUI:

    $IMPACT_HOME/bin/stopImpactServer.sh
    $IMPACT_HOME/bin/stopGUIServer.sh
    
    $IMPACT_HOME/bin/startImpactServer.sh
    $IMPACT_HOME/bin/startGUIServer.sh
    

The status of the IBM Tivoli Netcool/Impact integration should switch to Running in Cloud Pak for AIOps.

IBM Tivoli Netcool/Impact stops sending data to widgets in a JazzSM dashboard

After Creating IBM Tivoli Netcool/Impact integrations, the IBM Tivoli Netcool/Impact server stops sending data to widgets in a JazzSM dashboard. The widget will report an ATKRST132E or 404 error code.

Solution: Re-create the parameter files for the AIOPS_HandleAction and AIOPS_ExecJS Impact policies.

For more information, refer to the troubleshooting article Netcool/Impact datasets for DASH stop working after deploying the AIOps to Impact integration.

AIOps data source in IBM Tivoli Netcool/Impact is missing required headers

When an integration is made to IBM Tivoli Netcool/Impact, a RESTful data source is created in Netcool/Impact during the initialization stage.

However, the data source may be missing one or more fields including required headers. Attempting to edit or delete such a data source will fail with an error message:

Error : Version Control System failed. Check system error logs for further details

Solution:

To remedy this, you can add the required connection details and header values using the Netcool/Impact GUI.

  1. Log into the Netcool/Impact GUI. Open the Data Model tab and make a note of the data source name. For example, AIOps-impact-e92b7669-8809-4cc2-ab3c-5cbdf0db1534

  2. Open an SSH terminal to the primary Netcool/Impact server.

  3. Change to the etc directory of the application:

    cd /opt/IBM/tivoli/impact/etc
    
  4. Locate the data source file with the .ds extension. For example, NCI_AIOps-impact-e92b7669-8809-4cc2-ab3c-5cbdf0db1534.ds

    Note: If the file does not exist, create an empty file with the name of the missing data source.

  5. Use the svn add command to manually add the file to version control:

    /opt/IBM/tivoli/impact/platform/linux/svn/bin/svn add NCI_AIOps-impact-e92b7669-8809-4cc2-ab3c-5cbdf0db1534.ds
    
  6. Use the svn commit command to check in the file:

    /opt/IBM/tivoli/impact/platform/linux/svn/bin/svn commit -m "manual commit" NCI_AIOps-impact-e92b7669-8809-4cc2-ab3c-5cbdf0db1534.ds
    
  7. Log in to the Impact GUI Data Model tab. Open the AIOPs data source.

  8. Enter the following values in the connection details fields:

    Field Name Value
    Host Name <HOSTNAME>
    Resource Path /aiops
    Port 443
    Use HTTTPS Enabled

    Replace <HOSTNAME> with the Cloud Pak for AIOps hostname. For example, cpd-aiops.apps.clustername.cp.ibm.com.

  9. In Request Headers, add the following headers:

    Header Value
    x-tenant-id cfd95b7e-3bc7-4006-a4a8-a73a79c71255
    Content-Type application/json;charset=utf-8
    Authorization ZenApiKey <KEY>

    In the Authorization field, replace <KEY> with the API key. Follow the procedure from Configuring a connection in IBM Tivoli Netcool/Impact for IBM Cloud Pak for AIOps to retrieve the key.

    Note: In Netcool/Impact 7.1.0.32, you can hide the values by declaring the request headers as a protected header instead.

  10. In the Proxy Settings tab, set the Proxy Port to 8080 and select No Authentication as the Authentication Method.

    Note: The Password field is ignored.

  11. Save the data source. Click Test Connection to confirm the connection works.

Ansible Automation Controller integration not running due to prohibited egress

After configuring an Ansible Automation Controller integration, the integration reports Not running, and no Ansible templates are shown in the Automation Actions table.

Use the following steps to check whether the failure is due to prohibited egress:

  1. Run the following commands to view the RBA Automation Service pod logs, and check whether responses are being received from Ansible:

    export AIOPS_NAMESPACE=<AIOps installation namespace>
    oc logs -l app.kubernetes.io/component=rba-as -n ${AIOPS_NAMESPACE}
    

    Example output if responses are not being received from Ansible:

    Error: awxp/requestAwx.requestAwx: Request to Ansible failed... Request was not responded after 5 seconds
    

  2. Run the following commands to attempt to connect to the Ansible instance from the RBA Automation Service pod:

    export AIOPS_NAMESPACE=<AIOps installation namespace>
    oc exec $(oc get pod -l app.kubernetes.io/component=rba-as -n ${AIOPS_NAMESPACE} -o jsonpath='{.items[0].metadata.name}') -c rba-as -n ${AIOPS_NAMESPACE} -it -- /bin/bash
    
    # An interactive shell inside the container will open
    curl -k -vvv <Your Ansible host>
    

Solution: If responses are not received from Ansible and you cannot connect to Ansible, then create a NetworkPolicy that allows egress from the RBA service pods. Run the following commands:

export AIOPS_NAMESPACE=<AIOps installation namespace>

cat << EOF | oc apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: aiops-rba-rbs-egress
  namespace: ${AIOPS_NAMESPACE}
spec:
  egress:
  - {}
  podSelector:
    matchLabels:
      app.kubernetes.io/component: rba-rbs
  policyTypes:
  - Egress
EOF

cat << EOF | oc apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: aiops-rba-as-egress
  namespace: ${AIOPS_NAMESPACE}
spec:
  egress:
  - {}
  podSelector:
    matchLabels:
      app.kubernetes.io/component: rba-as
  policyTypes:
  - Egress
EOF