Troubleshooting integrations
Review frequently encountered issues with creating and managing outgoing and incoming integrations.
- Multiple integrations
- Integrations page
- Metrics integrations
- Instana
- Kafka
- Netcool
- ServiceNow
- Splunk
- IBM Tivoli Netcool/Impact
Multiple integrations
Integrations do not initialize
If an integration gets stuck in a state that is not running, there may be an issue with the secrets being created.
To check this log in to openshift and check the pods. For example, the netcool integration has the following issue:
NAME READY STATUS RESTARTS AGE
netcool-conn-cfe182e0-d7eb-42f6-9677-fb86b180bd0f-8bd546cbvv5km 0/1 ContainerCreating 0 27m
If you describe the pod oc describe pod netcool-conn-cfe182e0-d7eb-42f6-9677-fb86b180bd0f-8bd546cbvv5km
and you see a mount error in the events log such as the following:
34s (x20 over 25m) Warning FailedMount Pod/netcool-conn-cfe182e0-d7eb-42f6-9677-fb86b180bd0f-8bd546cbvv5km MountVolume.SetUp failed for volume "grpc-bridge-service-binding" : secret "connector-cfe182e0-d7eb-42f6-9677-fb86b180bd0f" not found
you can conclude that the secrets failed to create. The way to fix this is to use the following command:
oc get pods | grep connector-manager
Then delete the connector-manager pods and restart it. On restarting it, the secrets are re-created for all integrations.
Integrations page
On viewing the Integrations page, you might encounter the following issues:
- Deleting an integration does not remove the PVC for the integration
- Error displays when you activate Kafka, Humio, Splunk, Mezmo, ELK, PagerDuty, ServiceNow integrations
- Integrations on the Integrations page show a red 'Unknown' status
- Integrations UI pod in CrashLoopBackOff for the Integrations page
- Integrations page becomes unresponsive after some time or shows an error
- Incorrect integration status on the Integrations page
- The integration keeps showing Error or Retrying and integration pod is stuck in Pending
- Ansible Automation Controller integration not running due to prohibited egress
Deleting an integration does not remove the PVC for the integration
For StatefulSet type integrations, when the integration is deleted the Persistent Volume Claim (PVC) resource for the integration is not automatically deleted. You must manually delete the resource with the Red Hat OpenShift oc
CLI or with the Red Hat OpenShift UI.
This behavior occurs because the StatefulSet PVC is created by specifying the volumeClaimTemplate
, but the created PVC is not automatically deleted by Red Hat OpenShift when the StatefulSet resource is deleted.
To delete a PVC resource with the Red Hat OpenShift oc
CLI, complete the following steps:
-
Get the list of PVC resources for your current integrations:
oc get pvc | grep 'conn'
-
Select the PVC resource that you want to delete and record the name.
-
Delete the PVC resource:
oc delete pvc <pvc-name>
Where
<pvc-name>
is the name of the PVC resource that you want to delete.
Error displays when you activate Kafka, Humio, Splunk, Mezmo, ELK, PagerDuty, ServiceNow integrations
When you activate a Kafka, Falcon LogScale, Splunk, Mezmo, ELK, PagerDuty, ServiceNow integration in the console, an error can occur when the required Flink task slots to run the integration are not available. When this error occurs, the following message is displayed:
Failure: Not enough task slots to activate the connection.
This error occurs as there is a finite number of Flink task slots to use for creating integrations to remote resources. The error message displays the number of slots that are needed to support the current active integration set, along with the total number of slots available.
Solution: To activate the integration, you must either reduce the number of slots that are needed by this (or another) integration, or allocate more slots. To decrease the number of slots needed by a integration, consider reducing the "base parallelism" or "logs per second" value for the integration. This reduced value decreases the rate at which data flows through the integration, making more slots available to other integrations. To allocate more slots, see Increasing data streaming capacity.
Integrations on the Integrations page show a red 'Unknown' status
All integrations are showing red 'Unknown' integration status and data is not flowing after creating a new integration, such as a Mezmo integration. For the new integration if you enable 'Data collection', you can see that the data is not flowing, and the integration shows the same red 'Unknown' status in the Data collection status field instead of the expected 'Running' status. The common cause for this is unstable Kafka in the background.
Solution: If this issue occurs, restart the bridge and operator pod.
Integrations UI pod in CrashLoopBackOff for the Integrations page
You can encounter an intermittent issue that causes the Integrations page to fail and become unavailable for viewing.
Solution: If the page does not automatically recover, you can increase the timeout setting for the aimanager-aio-controller
deployment to help prevent the issue. To increase this setting, edit the aimanager-aio-controller
deployment (liveliness
and healthcheck
probe) and set it to a longer timeout of say 15 seconds, up from the default 5 seconds. This change can stop the Connections-ui
pod from going into CrashLoopBackOff.
-
Check the
aimanager-aio-controller
oc get pods | grep aio-controller aimanager-aio-controller-6d9f68d5f5-8dkk9 1/1 Running 0 27h
-
Edit the
aimanager-aio-controller
's deploymentoc edit deployment aimanager-aio-controller
-
Edit these 2 variables (
timeoutSeconds
) for both thelivenessProbe
and thereadinessProbe
to 15 seconds:livenessProbe - timeoutSeconds readinessProbe - timeoutSeconds
Integrations page becomes unresponsive after some time or shows an error
If you are using IBM Cloud Pak for AIOps, you might encounter an unresponsive Integrations page around 24 hours after you install your cluster. The integrations page might take a long time to load and then display no integrations despite configuring them earlier. Another symptom of this problem is that you might fail to create an integration and see the following error:
Failed submitting creating request
Solution: To ensure the Integrations page works normally, delete the aimanager-aio-controller
pod:
for n in $(oc get pods |grep aimanager-aio-controller |awk '{print $1}')
do
oc delete pod $n
done
Within one to two minutes of running the script, the pod will automatically restart and the Integrations page will work as usual.
Incorrect integration status on the Integrations page
You might encounter an incorrect integration status for all gRPC integrations (Instana, Netcool, AppDynamics, Dynatrace, AWS CloudWatch, Splunk, Zabbix, and New Relic) on the Integrations page. The integration status might appear as Unknown when it must be Running. The incorrect status might be the result of the pod connector-bridge being in CrashLoopBackOff state and restarting multiple times due to an unstable Kafka integration.
Solution: If this issue occurs, check the Kafka and zookeeper pods for any restarts. Kafka integration becomes unstable and restarts when the PVC used for Kafka is full. Review the current capacity of the PVC and increase the capacity as needed to resolve the issue. For more information about adjusting the size of the PVC, see Increasing Kafka PVC.
The integration keeps showing Error or Retrying and integration pod is stuck in Pending
On the Integrations UI page, an integration keeps showing Error or Retrying and the integration pod is stuck in Pending. This issue occurs because the Persistent Volume Claim (PVC) is requesting a volume from a storage
provider that does not support the ReadWriteOnce
(RWO) access mode.
Solution: You can use either one of the following methods to unblock the stuck integration pod:
- Change the default storage class
Or
- Re-create the Persistent Volume Claim with a different storage class
Method 1: Change the default storage class. Review the default storage class in the cluster and change to a storage class that supports ReadWriteOnce
(RWO) access mode, such as file storage types.
Note: This method should resolve integration PVC issues but since it changes the global default storage class type in the cluster, it might also affect other workloads that rely on the default storage class settings when requesting a volume.
-
Run the following command to determine which is the current default storage class.
oc get storageclass
-
Note the name of the default storage class
(default)
. -
Change the default storage class to one that supports the
ReadWriteOnce
(RWO) access mode like file storages. The following command sets thestorageclass.kubernetes.io/is-default-class
annotation totrue
on the new storage class and sets the annotation tofalse
on the old storage class. (Change the<new-default-storageclass-name>
and<old-default-storageclass-name>
with the actual storage class names in your cluster.)oc patch storageclass <old-default-storageclass-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' oc patch storageclass <new-default-storageclass-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
-
Verify that the default storage class is updated by running
oc get storageclass
-
Log in to the Cloud Pak for AIOps console and navigate to the Integrations page.
-
Delete the existing integration, then reinstall it.
-
Verify that the integration enters a Running state.
Method 2: Re-create the Persistent Volume Claim with a different storage class. Delete the existing PVC and re-create a new one with the same name that uses a storage class that supports the ReadWriteOnce
(RWO)
access mode.
Note: This method updates only a single integration instance. Repeat this step for all integrations unable to request a persistent volume, due to the incorrect storage class type.
-
Get the PVC name of the integration. You can list all PVCs by using
oc get pvc
. Look for the one that is still in Pending state with your integration name. -
Delete the existing PVC and re-create a new one with the same name. Review the following sample command as reference. Update the
<connector-pvc-name>
with the PVC name from the previous step and set<storage-class-name>
with the storage class that supports the requiredReadWriteMan
(RWX) access mode.# Set the PVC name CONNECTOR_PVC_NAME=<connector-pvc-name> # Set new storage class STORAGE_CLASS=<new-storage-class> CONNECTOR_APP=$(oc get pvc $CONNECTOR_PVC_NAME -o jsonpath='{.metadata.labels.app}') CONNECTOR_ID=$(oc get pvc $CONNECTOR_PVC_NAME -o jsonpath='{.metadata.labels.instance}') PVC_SIZE=$(oc get pvc $CONNECTOR_PVC_NAME -o jsonpath='{.spec.resources.requests.storage}') oc delete pvc $CONNECTOR_PVC_NAME && \ cat << EOF | tee >(oc apply -f -) | cat apiVersion: v1 kind: PersistentVolumeClaim metadata: name: $CONNECTOR_PVC_NAME labels: app: $CONNECTOR_APP app.kubernetes.io/instance: aiopsedge app.kubernetes.io/managed-by: aiopsedge-operator app.kubernetes.io/name: ibm-aiops-edge app.kubernetes.io/part-of: ibm-aiops-edge instance: $CONNECTOR_ID spec: accessModes: - ReadWriteOnce resources: requests: storage: $PVC_SIZE storageClassName: $STORAGE_CLASS volumeMode: Filesystem EOF
For example,
# Set the PVC name CONNECTOR_PVC_NAME=netcool-vct-netcool-conn-325fba02-440f-4135-bd00-8560bcf5eda1-0 # Set new storage class STORAGE_CLASS=rook-cephfs CONNECTOR_APP=$(oc get pvc CONNECTOR_PVC_NAME -o jsonpath='{.metadata.labels.app}') CONNECTOR_ID=$(oc get pvc CONNECTOR_PVC_NAME -o jsonpath='{.metadata.labels.instance}') oc delete pvc $CONNECTOR_PVC_NAME && \ cat << EOF | tee >(oc apply -f -) | cat apiVersion: v1 kind: PersistentVolumeClaim metadata: name: $CONNECTOR_PVC_NAME labels: app: $CONNECTOR_APP app.kubernetes.io/instance: aiopsedge app.kubernetes.io/managed-by: aiopsedge-operator app.kubernetes.io/name: ibm-aiops-edge app.kubernetes.io/part-of: ibm-aiops-edge instance: $CONNECTOR_ID spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Mi storageClassName: $STORAGE_CLASS volumeMode: Filesystem EOF
-
Verify that the integration pod transitions to
Running
state. -
If necessary, repeat the steps above on any other integration that is producing the same problem.
Metrics integrations
If you are creating, or have, a metrics integration, you might encounter the following issue:
Metrics connection is unable to pull further historical data when historical mode is completed
When a metrics integration (for example Dynatrace, Appdynamics, Zabbix, New Relic, Splunk, or AWS) has completed its historical data pulling, historical mode can never be restarted again even if the connection configuration is edited or updated from the UI to specify a different start date, end date or a different set of metrics.
Solution: If you need to pull historical data again for another set of metrics or different dates, you can create a new connection.
Instana integrations
If you are creating, or have, an Instana integration, you might encounter the following issues:
- integration is unable to update the metric list in the UI
- IBM Cloud Pak for AIOps cannot close some alerts when an Instana integration exists
- After a repair of Instana, alerts in the Instana application do not associate with the corresponding IBM Cloud Pak for AIOps application
- Due to an out-of-memory error the Instana integration pod keeps restarting when retrieving data from Instana
integration is unable to update the metric list in the UI
The Instana integration UI page displays a list of metrics from Instana that are updated when the Instana integration pod runs. Some Instana hosts can have many technology plug-ins installed, which collect many built-in and custom metrics. These metrics can cause the Instana integration to exceed the allowed message limit when the integration sends an update to the UI. If this limit is exceeded, the Instana integration pod logs the following message:
0000004d Connector I UI configuration patch exceeds the safe message limit, skip sending configuration patch. The metrics list in the UI could not be updated with the actual list of technologies and metrics. Please configure more filter patterns to reduce the number of metrics to query.
The integration logs each technology plug-in and the number of metrics that are included or filtered.
Solution: Apply additional filter patterns in the Instana configmap for the integration to exclude when querying the metric list from Instana.
-
Get the Instana pod logs to determine which technology plug-in contains many metrics
oc logs $(oc get pod | grep 'instana-conn' | awk '{print $1}') | grep -e "Plugin \'.*\' \(.*\)"
The following example log indicates that a
aceMessageFlow
technology plug-in that contains a large list of metrics, and that is a candidate to filter.Tip: Look for any technology plug-ins that contain more than 100 metrics.
0000004d InstanaInstan I Plugin 'ACE Message Flow' (aceMessageFlow) metrics: 2,666 included, 0 filtered.
-
Determine the available metrics names by technology from the Instana API or from your Instana administrator. To use the Instana API, you can use a command similar to the following sample command, which queries
aceMessageFlow
metrics:INSTANA_ENDPOINT=<your Instana host> INSTANA_TOKEN=<your Instana API token> curl --request GET -s --url ${INSTANA_ENDPOINT}/api/infrastructure-monitoring/catalog/metrics/aceMessageFlow \ --header "authorization: apiToken ${INSTANA_TOKEN}" \ | jq 'group_by(.pluginId)[]| {(.[0].pluginId): [.[] | .metricId] | sort}'
-
Create a pattern to filter metrics from the output of the previous command. For example,
flowNodes\\..*
to filter all metrics withflowNodes.
prefix. -
Get the Instana integration configmap name to add the filter pattern from the previous step.
Note: If multiple Instana integrations exist, select the configmap with the UID that matches the pod name.
oc get configmap | grep instana-connector
-
Extract the
exclude-filter-patterns
from the configmap. This command saves the content of the configmap into the local file:oc extract configmap/ibm-grpc-instana-connector-d59eea9f-501d-4bd1-b3a3-6f98b49b0930 --keys=exclude-filter-patterns --to=.
-
Edit the local
exclude-filter-patterns
file to add the metric pattern in addition to the default ones that are defined in thefilters
array.Note: To escape a slash
\
character, you must use double\\
. For example:{"filters":["fs\\./dev/.*","fs_mount\\./.*","metrics\\.counters\\..*\\..*\\..*","(.*\\.)?metrics\\.gauges\\..*\\..*\\..*","metrics\\.meters\\..*","(.*\\.)?metrics\\.timers\\..*","prometheus\\.metrics.*","flowNodes\\..*"]}
-
Apply the change to the configmap:
oc set data configmap/ibm-grpc-instana-connector-d59eea9f-501d-4bd1-b3a3-6f98b49b0930 --from-file=exclude-filter-patterns
-
Verify that the new filter pattern is loaded by the pod, and that the number of metrics that are filtered for the technology plug-in is changed.
oc logs $(oc get pod | grep 'instana-conn' | awk '{print $1}') | grep 'Loaded filter patterns from ConfigMap'
-
Verify that the Instana integration can send an update to the UI successfully:
oc logs $(oc get pod | grep 'instana-conn' | awk '{print $1}') | grep -e 'Status message .* write with configuration patch .* successful.'
-
In the Edit Instana integration UI page, the list of technologies and metrics list is now updated.
IBM Cloud Pak for AIOps cannot close some alerts when an Instana integration exists
When an integration with Instana exists, IBM Cloud Pak for AIOps might not be able to close some alerts. Instead, you might need to check the status for the associated event within the Instana UI and clear the alert manually.
Solution: To find the alert URL for clearing the alert, complete the following actions:
-
From the IBM Cloud Pak for AIOps console, open the Incidents and alerts tool.
-
Select the alert in the table to display the Alert details side panel for that alert. The Information section is open by default in the side panel.
-
Copy the URL from links and Paste the URL into your browser.
-
If you are an Instana authorized user, you see the associated event in the Instana UI.
If the event is not active and has "end-time" for the event in the Instana UI, clear the alert:
- Click Alert to see the alert details.
- Click Clear Alert to clear the alert.
After a repair of Instana, alerts in the Instana application do not associate with the corresponding IBM Cloud Pak for AIOps application
When you restart Instana post repair, alerts in the Instana application do not map with the corresponding application in IBM Cloud Pak for AIOps. The reason is that Instana creates new identifiers for the resources on its application on restarting, without updating the topology in IBM Cloud Pak for AIOps. As a result, the IDs of the resources do not match leading to the disassociation of alerts.
Solution: To map the alerts, you must complete the following steps:
- Clean up the application in Instana.
- Re-create the application after a few minutes.
To clean up and recreate applications in Instana, see Getting started with Instana.
Due to an out-of-memory error the Instana integration pod keeps restarting when retrieving data from Instana
When the Instana integration receives a large amount of data from one or more Instana APIs while the integration is observing many resources, and there is insufficient memory to process this large amount of data, an out of memory error can occur.
Solution: If this error occurs, you can apply custom memory settings to allocate more memory to the Instana integration pod by increasing the memory limit setting. This allocation can only be done post-installation as the integration must exist before applying the custom setting. To apply custom resource settings, follow the steps to patch the Instana integration:
Prerequisite: Ensure you have the yq
and jq
packages installed on your system before running the commands below. To get installation instructions for these packages, refer to the documentation or
website for yq
and jq
.
-
From a command line, change to your IBM Cloud Pak for AIOps project (namespace):
namespace=<aiops-namespace> oc project $namespace
-
Get the display name for the Instana integration you would like to increase resources for from the command line:
oc get connectorconfiguration
-
Set the required variables:
connectionname=<instana-connection-name> connconfig=$(oc get connectorconfiguration --no-headers | grep $connectionname | awk '{print $1}') connconfiguid=$(oc get connectorconfiguration $connconfig -o jsonpath='{.metadata.uid}') gitappname=$(oc get gitapp -l connectors.aiops.ibm.com/connection-id==$connconfiguid --no-headers | awk '{print $1}')
Note: Update the
connectionname
variable with your Instana display name from the command above. -
Run this command to create a variable with a patch specification for the Instana integration. This will apply only to one Instana integration at a time. This command should not have an output. Copy this command directly from the documentation and do not paste it into a text editor because this can affect the format.
jsonpatch=$(yq -o json <<EOF patch: |- apiVersion: apps/v1 kind: Deployment metadata: name: ibm-grpc-instana-connector spec: template: spec: containers: - name: ibm-grpc-instana-connector resources: requests: memory: 1536Mi ephemeral-storage: 2Gi limits: memory: 8Gi ephemeral-storage: 4Gi target: group: apps kind: Deployment name: ibm-grpc-instana-connector version: v1 EOF )
-
Run the following command to apply the custom resource setting that you created in the previous step:
oc get gitapp $gitappname -n $namespace -o json | jq ".spec.components[0].kustomization.patches += [$jsonpatch]" | oc apply -f -
Note: If you see an error message when running this command that states there was a configuration conflict, try running the command a second time.
-
It takes a moment for the changes to show in the integration pod.
Note: If the memory limit is not updated for the Instana integration pod after a few minutes, run oc get gitapp $gitappname
and check to see if there are any errors. If there is an error stating that the YAML
is invalid, run the command below to revert the patch and go through all steps again. The cause of this error is that the format of the commands may have been altered by a text editor and should be copied directly from the documentation
and run directly in the terminal.
Revert the custom resource settings to default resource settings:
To revert from custom to default resource settings, run the following command:
oc patch gitapp $gitappname -n $namespace --type='json' -p="[{'op': 'remove', 'path': '/spec/components/0/kustomization/patches'}]"
Kafka integrations
If you are creating, or have, a Kafka integration, you might encounter the following issues:
- Kafka events not displayed in IBM Cloud Pak for AIOps
- After copying log data into a Kafka cluster, verify that the data is imported into Elasticsearch
Kafka events not displayed in IBM Cloud Pak for AIOps
You cannot see Kafka events in IBM Cloud Pak for AIOps after successfully creating a Kafka integration. This failure occurs when the Kafka key or message violates the format requirements. The MismatchedInputException
error in
the task manager pod log indicates this formatting error.
Solution: To ensure that the Kafka events display in IBM Cloud Pak for AIOps, check whether the following conditions are met:
-
The key in Kafka payload is null.
Note: The key is determined by Kafka producer. Most producer tools use null object as key unless the producer is programmed to assign value.
-
The message in Kafka payload is a JSON message that satisfies the following requirements:
- Exists in the format of the selected data source.
- Contains sufficient data for the mapping.
After copying log data into a Kafka cluster, verify that the data is imported into Elasticsearch
To publish log data, you can copy it into a Kafka cluster. For more information, see Copying log data into the Kafka cluster. Then, you can check whether the log data is imported into Elasticsearch.
Solution: Verify that the log data is imported into Elasticsearch by getting shell access to the ai-platform-api-server
pod. Run the following commands from the namespace where IBM Cloud Pak for AIOps is installed:
oc exec -it $(oc get pods | grep ai-platform-api-server | awk '{print $1}') -- /bin/bash
Example:
curl -X GET -u $ES_USERNAME:$ES_PASSWORD $ES_URL/_cat/indices -k | sort
Example output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4710 100 4710 0 0 28373 0 --:--:-- --:--:-- --:--:-- 28373
green open .security L0Gou7AXSC6PyuZ8-skGRQ 1 0 1 0 4.6kb 4.6kb
green open aiops-searchservice_v10 Ymls6nL3S-KP4vC4ROdRRw 5 0 0 0 1kb 1kb
yellow open 1000-1000-20200803-logtrain QImZoOd7QiuM3mfxUIdFAA 8 2 24 0 34.5kb 34.5kb
yellow open 1000-1000-20200804-logtrain 54m5F8K6Q3iIbSZ4KvMlYQ 8 2 286 0 80.8kb 80.8kb
yellow open 1000-1000-20200805-logtrain I_GhHFz9Th62KV_7wGwyMA 8 2 137 0 61.1kb 61.1kb
yellow open 1000-1000-20200806-logtrain 7HZcyCaLRa2VoOwJrH7Ftg 8 2 292 0 76.2kb 76.2kb
yellow open 1000-1000-20200807-logtrain LzUFhGcGSs-Rp_uJtw17sw 8 2 686 0 126.1kb 126.1kb
yellow open 1000-1000-20200808-logtrain HR7n9NIATnm9MOjgnq2K_Q 8 2 268 0 77kb 77kb
yellow open 1000-1000-20200809-logtrain NSU1X31OTq2coxdo-_I1rQ 8 2 48 0 48.7kb 48.7kb
yellow open 1000-1000-20200810-logtrain 2M54aDMXRyOp_06XnInTEg 8 2 885 0 140.3kb 140.3kb
yellow open 1000-1000-20200811-logtrain pQjyF1pQTX6GT70WY7T7ZQ 8 2 797 0 132.2kb 132.2kb
yellow open 1000-1000-20200812-logtrain U0YasxsqTpuricpyyfQ62Q 8 2 1294 0 201.6kb 201.6kb
yellow open 1000-1000-20200813-logtrain pZA79rSSTYeTrsRuT8K44g 8 2 1155 0 178.3kb 178.3kb
yellow open 1000-1000-20200814-logtrain eu5qqE31TcGtz5yjVKxlnQ 8 2 720 0 134kb 134kb
yellow open 1000-1000-20200815-logtrain rKHrLUooSP-lx6r_ettalQ 8 2 42 0 46.2kb 46.2kb
yellow open 1000-1000-20200816-logtrain 0bRKhqgdQviTBc0FnJ4pvg 8 2 58 0 49.9kb 49.9kb
yellow open 1000-1000-20200817-logtrain PMrlHzjZQymb5hXiZanhOw 8 2 287 0 99.3kb 99.3kb
yellow open 1000-1000-20200818-logtrain ItLXqZIxQWy7pirLCPcfKw 8 2 483 0 103.7kb 103.7kb
yellow open 1000-1000-20200819-logtrain W2ti9RlVTRiW3ZpynEZ8wQ 8 2 879 0 141.8kb 141.8kb
yellow open 1000-1000-20200820-logtrain Zsz7Vd0rQpSuVwbUi4AfQw 8 2 488 0 98.7kb 98.7kb
yellow open 1000-1000-20200821-logtrain VVcsZ1HEQJi08WrjX6CJ_Q 8 2 268 0 43.4kb 43.4kb
yellow open algorithmregistry wLjGMGL4SI-weK666yr8pA 1 1 8 0 26.2kb 26.2kb
yellow open cp4waiops-cartridge-poli..... 6dQclBWSQA6FR3yRurpxVQ 1 1 17 0 66.8kb 66.8kb
yellow open dataset TA3TB-MNTkScyxDNZxuEcw 1 1 0 0 208b 208b
yellow open postchecktrainingdetails G8lHdh9WQ8OxIHhOAwBUnw 1 1 0 0 208b 208b
yellow open prechecktrainingdetails WA00PRwuTgWLsQNvcBcoug 1 1 0 0 208b 208b
yellow open trainingdefinition dC2UuOLiR_qGiax-8tEfEg 1 1 0 0 208b 208b
yellow open trainingrun fRqCLpQ1TCi_kt4zzNvwDA 1 1 0 0 208b 208b
yellow open trainingsrunning gzyQzqP7SJuy9dbUZI1dVA 1 1 0 0 208b 208b
yellow open trainingstatus AiLkAg0yRQKzzIpbsk0CCw 1 1 0 0 208b 208b
Netcool integrations
If you are creating, or have, a Netcool integration, you might encounter the following issues:
- Runbook automation pod "rba-rbs" crashes during an alert burst, for example, while connecting to Netcool
- ObjectServer alerts missing in IBM Cloud Pak for AIOps due to
No object found with id
error in the data layer - Setting values for mandatory fields for events sent to the embedded ObjectServer
- Events from IBM Tivoli Netcool/OMNIbus are not updated in Alert Viewer
- Test connection fails in the Netcool connector UI configuration page
- Fresh alerts are missing in Alert Viewer
Runbook automation pod "rba-rbs" crashes during an alert burst, for example, while connecting to Netcool
The "rba-rbs" pod crashes and gets restarted automatically, and most of the fully automated runbooks that are associated with the new alerts are not run. This restart behavior might be observed multiple times until all the events from the event burst have been ingested, and the system reaches its regular state of operation again.
Solution: During an event burst like the one that occurs when connecting to a Netcool server, all enabled policies are evaluated, and for the matching events, the associated fully automated runbooks are started. If many matching events exist, these events might result in a high load of concurrent runbook invocations that the runbook service rba-rbs is not able to handle. The system recovers automatically. However, some runbooks will not be run during the burst.
ObjectServer alerts missing in IBM Cloud Pak for AIOps due to a No object found with id
error in the data layer
If a new alert is rejected by the data layer, subsequent update alerts may be rejected by the data layer with the error: No object found with id
being found in the cp4waiops-cartridge.irdatalayer.errors
Kafka topic.
Solution: The AIOpsAlertId
column must be cleared so that when its updates reach the connector, the data will be perceived as a fresh alert by IBM Cloud Pak for AIOps.
To do this, use the following steps:
- Disable the connector's data collection (this will trigger the gateway to shut down).
- Go to the connector pod, remove the gateway cache file and SAF file under the following directory:
/bindings/netcool-connector/omnibus/var/G_CONNECTOR
- In the ObjectServer:
- Rectify the data columns that had resulted in wrongly mapped values.
- Clear the
AIOpsAlertId
column of the alert.
- Enable the connector's data collection.
When processing the next update event to that alert, the connector will see an empty AIOpsAlertId
and will generate "type": "update"
in the northbound CloudEvent payload.
Setting values for mandatory fields for events sent to the embedded ObjectServer
If you are manually inserting events into the embedded ObjectServer using nco_sql
, you must set values for the following fields so that the events can be processed correctly:
Node
AlertKey
AlertGroup
Manager
Note: This is only a concern if you are creating an event manually using an insert statement. All IBM Netcool Operations Insight probes will set all required columns.
Events from IBM Tivoli Netcool/OMNIbus are not updated in Alert Viewer
You might notice an issue where events from IBM Tivoli Netcool/OMNIbus reach Cloud Pak for AIOps, but the updates are not reflected in the Alert Viewer. The reason for this issue is that the Netcool connector might have restarted. The connector restart causes existing alerts in Cloud Pak for AIOps to dissociate from the origin alerts in the ObjectServer.
Solution: Upgrade Cloud Pak for AIOps to version 4.5.1 or newer. After the upgrade, the new alerts are not impacted by the Netcool connector restart.
Test connection fails in the Netcool connector UI configuration page
You might notice that the test connection fails in the Netcool connector UI configuration page.
Solution: Use the following steps to diagnose and resolve the connection failure.
-
Set the IBM Netcool Operations Insight ObjectServer log message level to
debug
. This can be done by updating theMessageLevel
property in the$OMNIHOME/etc/{ObjectServer_name}.props
file on the Netcool host.ObjectServer_name
is the name of IBM Netcool Operations Insight ObjectServer. -
Click test connection in the Netcool UI. Check if the IBM Netcool Operations Insight ObjectServer login attempt is received.
-
If no log in attempt is seen in the IBM Netcool Operations Insight ObjectServer log, check whether the IBM Netcool Operations Insight ObjectServer host IP that is resolved on Cloud Pak for AIOps matches the IBM Netcool Operations Insight ObjectServer host IP.
-
If a log in attempt is seen but rejected by the OS authentication, check the Transport Layer Security (TLS) certificate, especially the Common Name (CN) value, which must match the fully qualified domain name (FDQN) of the server.
-
-
If the Connection manager is down or not responsive to the request to connect to the IBM Netcool Operations Insight ObjectServer, then restart the Connection manager pod.
-
For an on-premises IBM Netcool Operations Insight ObjectServer integration, you might notice these issues:
-
If the CN value in the TLS certificate does not match the host server FQDN, create a TLS certificate that contains the host FQDN.
-
If a firewall restriction exists on the IBM Netcool Operations Insight ObjectServer port, then update the firewall rules to unblock the port.
-
Fresh alerts are missing in Alert Viewer
You might notice an issue where the most recent alerts are not showing in Alert Viewer. You need to inspect the data in different stages of the pipeline to diagnose the problem.
When an alert is missing, you can scan the north-bound Kafka topics for the alert message. The following example shows the direction of the north-bound data flow:
Netcool connector -> cp4waiops-cartridge.lifecycle.input.alerts
[topic] -> cp4waiops-cartridge.irdatalayer.requests.alerts
[topic] -> Alert Viewer
Where:
[topic]
refers to a Kafka topic.cp4waiops-cartridge.lifecycle.input.alerts
is a Kafka topic where all the new alerts are collected.cp4waiops-cartridge.irdatalayer.requests.alerts
is a Kafka topic.cp4waiops-cartridge.irdatalayer.requests.alerts
is one of the outcomes by the lifecycle.
Solution: Use the following steps to diagnose and resolve the missing alerts issue.
Note: Download and extract the Ext_IAFKafkaTools.tar.gz
file located in Download Ext_IAFKafkaTools.tar.gz. After
you extract the Ext_IAFKafkaTools.tar.gz
file, follow the steps in the README.txt
file.
-
Check whether an alert message appears in the
cp4waiops-cartridge.irdatalayer.errors
topic. Look for the alert message and run the following command.consumer.sh | grep <pattern>
Inspect the cause of the error in the message.
-
Check whether the alert message appears in the
cp4waiops-cartridge.lifecycle.input.alerts
topic.The following example shows an alert message with
type: update
.{"tenantid":"cfd95b7e-3bc7-4006-a4a8-a73a79c71255","requestid":"b6079c67-8b2c-4a84-ae1d-9e50df05bfa1","requestTime":"2024-04-16T03:23:37.443484418Z","type":"update","entityType":"alert","entity":{"insights":[{"details":{"lastProcessedEventOccurrenceTime":"2024-04-16T03:23:33Z"},"id":"event-occurrence","type":"[aiops.ibm.com/insight-type/deduplication-details](https://aiops.ibm.com/insight-type/deduplication-details)"}],"deduplicationKey":"{hostname=null, name=null}-UserLogoutSessionEvent-","eventCount":84341,"state":"open","id":"9e96637d-15cf-4618-bdf7-ef97e6016128","type":{"classification":"UserLogoutSessionEvent","eventType":"problem"},"lastOccurrenceTime":"2024-04-16T03:23:33.000Z"}}
-
If a new alert message appears in
cp4waiops-cartridge.lifecycle.input.alerts
and fulfills the positive criteria oftype: create
, check whether the message is delivered tocp4waiops-cartridge.irdatalayer.requests.alerts
.If the message is not found in the
requests.alerts
topic, the message is either blocked or rejected by the lifecycle process or not yet consumed from theinput.alerts
topic. -
If the message passes through
cp4waiops-cartridge.irdatalayer.requests.alerts
, and is not reflected in the Alert Viewer, check the data layer process. -
Restart the pod of the processes that are not working based on the error in the message. The pods that might need a restart depending on the error can be
Grpc server
,Life cycle processor
, andKafka broker
. -
If Kafka topics storage limitation is an issue, update the Kafka storage size.
ServiceNow integrations
Incidents aren't created
Incidents might not be created for various reasons.
Solution: Complete the following checks to help ensure that incidents can be created.
- Validate the ServiceNow URL and credentials. Go to the ServiceNow integration page, select the integration, and select Test connection. If the test fails, check that the URL and credentials are correct.
- Make sure that the versions of Cloud Pak for AIOps and the IBM ServiceNow marketplace applications are compatible. For more information about which version of Cloud Pak for AIOps is in use, see Prerequisites in Creating a ServiceNow integration.
- Make sure that a policy is configured to use the existing ServiceNow connection. For more information about creating policies, see Promote alerts to an incident.
- Make sure that the ServiceNow integration user has the correct permissions. For more information, see Managing roles.
- Incident tables can have business rules that prevent incident creation. Check business rules and change them so that they don't affect incident tables.
Splunk integrations
If you are creating, or have, a Splunk integration, you might encounter the following issues while creating or editing the integration:
- All fields in the Splunk form reset if edited while loading metrics
- Splunk integrations that are created before upgrade fail to update after upgrade
All fields in the Splunk form reset if edited while loading metrics
If you edit the fields in the Splunk form while the metrics are loading in the integration console, then you cannot save the changes made to the form.
Solution: To avoid this issue, do not modify the fields in the Splunk form while metrics are loading. If you do so, then redo the changes after the metrics are loaded to save the edits.
Splunk integrations that are created before upgrade fail to update after upgrade
On upgrading from 3.5.0, you might not be able to enable or disable log or metric data flows on Splunk integrations that were created before upgrade. This issue can occur intermittently.
Solution: When this issue occurs, you can refresh the console and try again after some time. Alternatively, you can create a new integration and later delete the Splunk integration that was defined before upgrade.
IBM Tivoli Netcool/Impact integrations
If you are creating, or have, a IBM Tivoli Netcool/Impact integration, you might encounter the following issues:
- IBM Tivoli Netcool/Impact integration status shows "Retrying"
- IBM Tivoli Netcool/Impact stops sending data to widgets in a JazzSM dashboard
- AIOps data source in IBM Tivoli Netcool/Impact is missing required headers
IBM Tivoli Netcool/Impact integration status shows "Retrying"
After Creating IBM Tivoli Netcool/Impact integrations, you might see a Connection Orchestration status of "Done", but the IBM Tivoli Netcool/Impact integration shows an error with a status of "Retrying".
The reason for this error might be a locked IBM Tivoli Netcool/Impact file that is preventing the creation of the IBM Cloud Pak for AIOps data model in IBM Tivoli Netcool/Impact. If this is the issue, the impactserver-errors
logs
in IBM Tivoli Netcool/Impact will display an error similar to the following:
30 Jun 2023 03:27:21,137 ERROR [DataModelUIResource] File etc/NCIP_datasourcelist is locked by user: root.
Solution: Locate and delete a file called <servername>_versioncontrol.locks
in IBM Tivoli Netcool/Impact and restart the IBM Tivoli Netcool/Impact server and GUI.
-
Log in to the IBM Tivoli Netcool/Impact host and navigate to the
$IMPACT_HOME/etc
directory. Locate a file called<servername>_versioncontrol.locks
. For example,NCIP_versioncontrol.locks
. -
Delete the file.
-
Restart the Impact server and GUI:
$IMPACT_HOME/bin/stopImpactServer.sh $IMPACT_HOME/bin/stopGUIServer.sh $IMPACT_HOME/bin/startImpactServer.sh $IMPACT_HOME/bin/startGUIServer.sh
The status of the IBM Tivoli Netcool/Impact integration should switch to Running in Cloud Pak for AIOps.
IBM Tivoli Netcool/Impact stops sending data to widgets in a JazzSM dashboard
After Creating IBM Tivoli Netcool/Impact integrations, the IBM Tivoli Netcool/Impact server stops sending data to widgets in a JazzSM dashboard. The widget will report an ATKRST132E
or 404
error code.
Solution: Re-create the parameter files for the AIOPS_HandleAction and AIOPS_ExecJS Impact policies.
For more information, refer to the troubleshooting article Netcool/Impact datasets for DASH stop working after deploying the AIOps to Impact integration.
AIOps data source in IBM Tivoli Netcool/Impact is missing required headers
When an integration is made to IBM Tivoli Netcool/Impact, a RESTful data source is created in Netcool/Impact during the initialization stage.
However, the data source may be missing one or more fields including required headers. Attempting to edit or delete such a data source will fail with an error message:
Error : Version Control System failed. Check system error logs for further details
Solution:
To remedy this, you can add the required connection details and header values using the Netcool/Impact GUI.
-
Log into the Netcool/Impact GUI. Open the Data Model tab and make a note of the data source name. For example,
AIOps-impact-e92b7669-8809-4cc2-ab3c-5cbdf0db1534
-
Open an SSH terminal to the primary Netcool/Impact server.
-
Change to the
etc
directory of the application:cd /opt/IBM/tivoli/impact/etc
-
Locate the data source file with the .ds extension. For example,
NCI_AIOps-impact-e92b7669-8809-4cc2-ab3c-5cbdf0db1534.ds
Note: If the file does not exist, create an empty file with the name of the missing data source.
-
Use the
svn add
command to manually add the file to version control:/opt/IBM/tivoli/impact/platform/linux/svn/bin/svn add NCI_AIOps-impact-e92b7669-8809-4cc2-ab3c-5cbdf0db1534.ds
-
Use the
svn commit
command to check in the file:/opt/IBM/tivoli/impact/platform/linux/svn/bin/svn commit -m "manual commit" NCI_AIOps-impact-e92b7669-8809-4cc2-ab3c-5cbdf0db1534.ds
-
Log in to the Impact GUI Data Model tab. Open the AIOPs data source.
-
Enter the following values in the connection details fields:
Field Name Value Host Name <HOSTNAME>
Resource Path /aiops Port 443 Use HTTTPS Enabled Replace
<HOSTNAME>
with the Cloud Pak for AIOps hostname. For example,cpd-aiops.apps.clustername.cp.ibm.com
. -
In Request Headers, add the following headers:
Header Value x-tenant-id cfd95b7e-3bc7-4006-a4a8-a73a79c71255
Content-Type application/json;charset=utf-8
Authorization ZenApiKey <KEY>
In the Authorization field, replace
<KEY>
with the API key. Follow the procedure from Configuring a connection in IBM Tivoli Netcool/Impact for IBM Cloud Pak for AIOps to retrieve the key.Note: In Netcool/Impact 7.1.0.32, you can hide the values by declaring the request headers as a protected header instead.
-
In the Proxy Settings tab, set the Proxy Port to
8080
and selectNo Authentication
as the Authentication Method.Note: The Password field is ignored.
-
Save the data source. Click Test Connection to confirm the connection works.
Ansible Automation Controller integration not running due to prohibited egress
After configuring an Ansible Automation Controller integration, the integration reports Not running, and no Ansible templates are shown in the Automation Actions table.
Use the following steps to check whether the failure is due to prohibited egress:
-
Run the following commands to view the RBA Automation Service pod logs, and check whether responses are being received from Ansible:
export AIOPS_NAMESPACE=<AIOps installation namespace> oc logs -l app.kubernetes.io/component=rba-as -n ${AIOPS_NAMESPACE}
Example output if responses are not being received from Ansible:
Error: awxp/requestAwx.requestAwx: Request to Ansible failed... Request was not responded after 5 seconds
-
Run the following commands to attempt to connect to the Ansible instance from the RBA Automation Service pod:
export AIOPS_NAMESPACE=<AIOps installation namespace> oc exec $(oc get pod -l app.kubernetes.io/component=rba-as -n ${AIOPS_NAMESPACE} -o jsonpath='{.items[0].metadata.name}') -c rba-as -n ${AIOPS_NAMESPACE} -it -- /bin/bash # An interactive shell inside the container will open curl -k -vvv <Your Ansible host>
Solution: If responses are not received from Ansible and you cannot connect to Ansible, then create a NetworkPolicy that allows egress from the RBA service pods. Run the following commands:
export AIOPS_NAMESPACE=<AIOps installation namespace>
cat << EOF | oc apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: aiops-rba-rbs-egress
namespace: ${AIOPS_NAMESPACE}
spec:
egress:
- {}
podSelector:
matchLabels:
app.kubernetes.io/component: rba-rbs
policyTypes:
- Egress
EOF
cat << EOF | oc apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: aiops-rba-as-egress
namespace: ${AIOPS_NAMESPACE}
spec:
egress:
- {}
podSelector:
matchLabels:
app.kubernetes.io/component: rba-as
policyTypes:
- Egress
EOF