Known issues and limitations

Review the known issues for IBM Cloud Pak® for AIOps.

Also, review the troubleshooting documentation for more information on common issues. For more information, see troubleshooting.

Install and upgrade

Also, review the troubleshooting documentation for more information on common issues. For more information, see troubleshooting.

Unable to reach inventory service

If your system is installed in the same namespace as IBM Cloud Pak® for Network Automation, the connection to the inventory service breaks. This is caused by a clash with another service with the same name.

Workaround: Use the config.yaml file to enforce the correct values for INVENTORY_SERVICE_HOST and INVENTORY_SERVICE_PORT.

Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: noi-topology-sizing
  namespace: namespace
data:
  asm: |
    ui-api:
      containers:
        ui-api:
          env:
            - name: INVENTORY_SERVICE_HOST
              value: noi-topology-inventory.namespace.svc
            - name: INVENTORY_SERVICE_HOST
              value: "9178"

Limitation on number of instances

IBM Cloud Pak for AIOps and Infrastructure Automation can co-exist on the same cluster, but you cannot have multiple instances of IBM Cloud Pak for AIOps or Infrastructure Automation on the same cluster.

Manual adjustments are not persisted

Custom patches, labels, and manual adjustments to IBM Cloud Pak for AIOps resources (such as increased CPU and memory values) are lost when an event such as upgrade, pod restart, resource editing, or node restart triggers a reconciliation. Reconciliation causes any manually implemented adjustments to be reverted to their original default values. Depending on the parameters that you want to adjust, you might be able to use a custom profile to persist your changes. For more information about custom profiles, see Custom profiles.

Services fail to connect to Cassandra

After you install IBM Cloud Pak for AIOps for a production environment deployment, various services might not be available due to connection issues with Cassandra. To resolve this issue if it occurs, restart Cassandra and the schema creation pods.

The ibm-aiops-orchestrator pod throws an OOMKilled error

If your environment has many secrets and ConfigMaps, when the ibm-aiops-orchestrator (lead operator) attempts to build its cache, the operator can exceed its memory allocation and cause a Kubernetes out-of-memory error for the container. This error can prevent the IBM Cloud Pak for AIOps installation from reconciling, blocking the installation from completing.

If you encounter this issue, the operator requires more memory resources to build its cache. Override the subscription resource to increase the memory limits for the pod and avoid the out-of-memory issue.

Automatic approval required for installation

The use of manual approval strategies for InstallPlans in a project (namespace) can affect the IBM Cloud Pak for AIOps installation.

For instance, if you use manual approval for any of your InstallPlans to install operators in All Namespaces mode (cluster scope), the manual approval can affect your install. The installation of IBM Cloud Pak for AIOps requires automatic approval to be used.

Elasticsearch health status yellow after restoring from a backup

When you are restoring an Elasticsearch backup to a new single node Elasticsearch cluster, the Elasticsearch database might not work as expected. Instead, the Elasticsearch cluster health shows that a yellow status after the restore completes.

ChatOps Microsoft Teams integration does not work with a proxy server

If you have an offline (air-gapped) deployment of IBM Cloud Pak for AIOps or an environment that uses a proxy server, then you cannot use the ChatOps Microsoft Teams integration. The use of a proxy with the Chatops Microsoft team integration is not supported.

Kafka topics cannot be listed with oc get kafkatopics

Kafka topics are created by using the KafkaTopic custom resource or API. Only topics that are created by custom resource are shown when running oc get kafkatopics. To see a full list of all the Kafka topics use the following steps:

  1. Install kcat.
  2. Get the waiops-mustgather.sh script. For more information, see Installing the IBM Cloud Pak for AIOps MustGather tool.
  3. Run waiops-mustgather.sh -V kafka-topics

Backup and restore

Backup and restore of IBM Cloud Pak for AIOps with older versions of IBM Fusion does not work with Portworx

If you are using Portworx as your storage provider, then IBM Fusion must be v2.9.0, or backup and restore will fail.

Access control

Sharing topology URLs overides group restrictions

When users with different group profiles share topology URLs, for example in a scenario where user one can see and view a group-based topology including its members, but user two should be restricted from seeing that group, that restriction is ignored if user one shares a URL with user two.

Workaround: The administrator can create a redaction policy to ensure that details are hidden for the group members.

Automation Analyst role unused in IBM Cloud Pak for AIOps

By default, an Automation Analyst role is displayed within the IBM Cloud Pak for AIOps console Access control page when you are assigning a role to a user. This default role is used within the IBM® Automation family of offerings, which includes IBM Cloud Pak for AIOps, however, this role is not used within IBM Cloud Pak for AIOps.

This role does not include or provide any permissions within IBM Cloud Pak for AIOps and should not be assigned to users within IBM Cloud Pak for AIOps.

Users from a user group remain after user group is deleted

When you delete a user group, the users that were included in the group remain in your list of users. Any role that is inherited through the deleted user group is removed from the users. If the users were assigned roles individually, they continue to have those roles and can continue to log in to the UI console and complete tasks. If the users that were in the deleted user group need to be removed completely, an administrator needs to manually remove the users. Users can be removed by clicking the Delete icon for the user's entry within the list of users on the Access control Users tab.

UI redirects to unexpected page when logging in after a session timeout

After a session timeout occurs and a user logs in to the UI console again, the user can be redirected to a different page than the page that they were on when the timeout occurred. For instance, a user that was working on the AI Model Management training page when their session timed out might be redirected to a graphql playground page after logging back in. This redirect occurs because the UI uses the last request URL that included the expired token to identify where to redirect the user when the user logs back in. If this redirect occurs, the user needs to manually return to the expected page in the UI to continue working.

Users in a user group are not listed under the Manage assignees pane of the Incidents and alerts page

When you have users within a user group and view the Manage assignees pane of the Incidents and alerts page, you might not see some users who are listed. This error can occur when the users from the LDAP user group are not individually onboarded. To verify whether a user is onboarded, go to the Access control > Users tab and check whether the user is listed. If the user is not listed, that user must first log in to the console, which validates their roles and permissions. After logging in, the user can display in the list of users and on the Manage assignees pane.

The Manage assignees pane is viewable from the list of all Incidents. Select an incident and then click Manage assignees. After you select an existing user group, you should see the included users who are listed.

Identity Management

The IBM Cloud Pak foundational services Identity Management (IM) service is used by IBM Cloud Pak for AIOps. This service includes the following known issues and limitations:

Note: Some descriptions of the listed known issues in Identity Management are shortened.

  • Login failure in Platform UI console while upgrading foundational services version 3.22 or version 3.23 to foundational services version 4.x.x.
  • LDAP user names are case-sensitive.
  • The OpenShift group does not synchronize when a user is added or removed from an LDAP group.
  • The OpenShift users are not removed when you remove them from the LDAP group.
  • You cannot onboard OpenShift user group to Identity Management as the groups property of the user.openshift.io API is deprecated.

For more information about all the known issues and limitations that are related to Identity Management, see Known issues in foundational services.

Observers and integrations

Kubernetes Observer job fails to restart after OOM

Kubernetes Observer jobs with very large payloads can encounter an OOM (out-of-memory) error, after which they may fail to restart. The observer appears offline, but a health check fails to flag any errors.

Workaround: Restart the observer if it appears as offline in the UI.

ServiceNow Observer UI displays superfluous characters

If the Entity_mapping field is updated from the Observer UI, superfluous curly brackets and quotations are displayed in the Resource type-mapping field.

Workaround: This is a cosmetic issue and can be ignored.

'Failed to read certificate' error

This error can occur when an observer attempts to create an SSL certificate and the endpoint server does not respond.

In the example error message below, a vmvcenter.crt certificate error occurs because the endpoint server does not respond.

Failed to read certificate for field [certificate].
The file 'vmvcenter.crt' could not be found under /opt/ibm/netcool/asm/security/.

Workaround: Ensure that the endpoint server is running correctly.

Duplication of resources in the topology if certain observer job parameters are changed after a job has been run

Certain resource parameters are used to uniquely identify a resource. If one of these parameters is changed after the initial job run, then any subsequent job run will result in duplicate records. For example, if the parameter of 'hostname' is replaced with 'Ipaddress' after a topology has been created, a subsequent discovery will consider the resource as new, and create a duplicate record.

The following resource parameters uniquely identify a resource. Changing them after the initial job has been run will result in duplicate records.

Workaround: If you need to modify these values, do not modify the existing job. Instead, create a new job.

Table. Observer job parameters
Observer Job parameter
ALM n/a
AppDynamics account
AWS region, dataTenant
Ansible AWX host, user
Azure data_center
BigFix Inventory data_center
Big Cloud Fabric proxy-hostname, proxy-username, bcf-controllers
Ciena Blue Planet data_center, tenant
Cisco ACI tenant_name
DNS addressTypes, server, port, recurse
Docker endPoint.port
Dynatrace datatenant, hostname
File provider, file
GitLab datatenant, hostname
GoogleCloud project_id
HPNFVD datacenter, username, cnf_job
IBM Cloud instance, username, region
ITNM instance
Jenkins jenkins_observation_namespace
Juniper CSO cso_central_ms_url, user_domain_name, domain_project_tenant_name
Juniper Contrail api_server_url, os_project_name, os_tenant_name
Kubernetes data_center, namespace
NewRelic accountName, accountId
OpenStack data_center, os_project_name
Rancher accessKey, clusterId
REST provider
SDC ONAP host, username
ServiceNow instance_url, username
SevOne datatenant, hostname
TADDM api_url, username
Viptela data_center
VMware NSX data_center
VMware vCenter data_center
Zabbix data_center

Incomplete historical data processing in the event of integration pods restarting

If you create an integration to collect historical data for Metric Anomaly AI Training, you might come across an issue where the integration pod restarts, but does not retrieve all historical data for training. As a result, you might suffer data loss.

An integration pod can restart due to outages, the target system crashing, or pod crashes in the environment. This issue can occur intermittently, depending on the number of metrics that are selected for the integration and the amount of data to be retrieved.

File and Rest observer topology service location URL not accessible

When creating an edge via either the File or Rest observers, the POST request returns a Topology service location URL that is not accessible. The URL cannot be used to manage the edge because the relevant API is not exposed. Workaround: None

Integration console displays special characters incorrectly

If you use special characters in the Name and Description fields while creating or editing an integration, the console Integrations page might display the special characters incorrectly. Nevertheless, the integration is saved.

Turbonomic integration affects other integrations in Turbonomic

The Turbonomic integration with IBM Cloud Pak for AIOps enables actions that are created or executed in Turbonomic to be notified in IBM Cloud Pak for AIOps through the enabled webhook workflow. However, Turbonomic allows only one webhook workflow per action. Therefore, other integrations that are enabled in Turbonomic, like ServiceNow, might not get any notification when actions are created or executed in Turbonomic.

No notification in IBM Cloud Pak for AIOps on Turbonomic actions closed without execution

IBM Cloud Pak for AIOps does not receive any notification from Turbonomic for actions that are closed without being executed. For example, an action related to an erroneous condition that is no longer occurring gets automatically closed in Turbonomic. But its corresponding IBM Cloud Pak for AIOps alert remains open indefinitely and must be cleared manually from the console.

New Relic observer does not support dashboard tokens for new users

For new users of the New Relic observer, the observer does not work as it no longer supports the New Relic One dashboard token. However, it will continue to work for existing users who are using the old token that was generated previously through the old dashboard.

All dates and time are in US-en format

When you are scheduling data collection for a integration, all dates and times are presented in the US-en formats:

  • All dates are configured and presented in the mm/dd/yyyy format.
  • All times are configured and presented in the hh:mm AM/PM 12-hour clock format.

You cannot switch the date or time format.

Appdynamic historical start date and time cannot be older then 4 hours

Historical start date and time is configurable, but if you set it to beyond the past 4 hours, then the integration will ignore it and only retrieve the past 4 hours of data.

Appdynamic live mode aggregation interval is 1 minute

In live mode, the new aggregation interval allowed is only 1 minute.

The observer-service pod is in a crash back loop due to a ghost vertex

If you notice that the topology observer-service pod is not functioning correctly and that restarting the pod does not correct the issue, a ghost vertex might need to be removed. To remove the vertex, you need to traverse an edge to the vertex, and then delete the vertex. To traverse to the vertex, use the type vertex and definesType edge.

  1. Run the following command to find the ID for the type vertex.

    oc exec -it <topology pod> -- curl -X GET --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/types?_filter=keyIndexName=ASM::entityType::mgmtArtifact::ASM_OBSERVER_JOB' -u <username><password> --insecure
    

    Where

    • <username> - Your Topology IBM Cloud Pak for AIOps API username
    • <password> - Your Topology IBM Cloud Pak for AIOps API password
  2. Run the following command to use the definesType edge to get the ID for the vertex that is causing the issue.

    oc exec -it <topology pod> -- curl -X GET --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0'  'https://localhost:8080/1.0/topology/resources/<type vertex ID>/references/out/definesType' <username><password> --insecure
    

    Where

    • <type vertex ID> - The ID for the type vertex that you retrieved in step 1.
    • <username> - Your Topology IBM Cloud Pak for AIOps API username
    • <password> - Your Topology IBM Cloud Pak for AIOps API password
  3. Run the following command to delete the vertex.

    oc exec -it <topology pod> -- curl -X DELETE --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/resources/<type vertex ID>/references/out/definesType?_delete=nodes&_delete_self=false&_node_label=vertex' -u <username><password> --insecure
    

    Where

    • <type vertex ID> - The ID for the type vertex that you retrieved in step 1.
    • <username> - Your Topology IBM Cloud Pak for AIOps API username
    • <password> - Your Topology IBM Cloud Pak for AIOps API password

Delay query time for integrations

If you set an integration to retrieve Live data for continuous AI training and anomaly detection or Live data for initial AI training, you might need to configure a delay to offset the query time window to provide a time buffer for preventing the partial retrieval of real-time data. You cannot configure this delay within the UI console. You must use a command line to configure the delay. For more information about configuring this delay, see Delay configuration in data integrations.

Alerts for Instana without associated topologies

In some cases Instana alerts will not have associated topologies. In most cases this happens because the resource that originated the event is no longer available in Instana. For example, a pod that triggers an Instana event can be redeployed by the underlying kubernetes engine.

Alerts for Instana topology not mapping properly

In some cases alerts are not mapped correctly to a corresponding Instana topology node. For example, alerts generated from log anomaly detection or metric anomaly detection (or other sources) might not show as associated with a Instana topology node.

As a workaround, you need to define your own match rules to correlate with the source data. To define a match rule, click on Resource Management, then click Settings, Topology Configuration, and finally Configure on the Rules Tile. When you are configuring the match token values to use, the values depend on the data that you are sending to Instana.

Instana metric collection API rate limit exceeded error

The recommended rate limit will be double the number of resources. There might be situations where the limits need to be increased. When a more precise limit is required, use the following formula to determine the limit to use:

number-of-metric-API-calls-per-hour ~= (number-of-selected-technologies x 2) x (snapshots-for-selected-technologies / 30) x (60 / collection-interval)
number-of-topology-API-calls-per-hour ~= (number-of-application-perspectives x (60 / collection-interval)) +(number-of-services x (60 / collection-interval))
number-of-events-API-calls-per-hour = 60

total= number-of-metric-API-calls-per-hour + number-of-topology-API-calls-per-hour + number-of-events-API-calls-per-hour

Note: Each plugin can have a different number of metrics collected. The mean value across these is used, which is 2 collection cycles per plugin. If the environment is unbalanced, for instance you have mostly hosts that define most metrics, then the formula might underestimate the required limit.

To determine the number of resources (snapshots) for each infrastructure pluginAPI, use the following:

api/infrastructure-monitoring/snapshots?plugin=technology_name

Example:

api/infrastructure-monitoring/snapshots?plugin=host

For more information about Instana APIs, see Instana API.

The following example curl commands allow you to retrieve the number of:

  • snapshots-for-selected-technologies (such as host)

    curl -k -s --request GET 'https://<instana server hostname>/api/infrastructure-monitoring/snapshots?plugin=host' --header 'Authorization: apiToken <api token>' | jq '.items|length'
    
  • number-of-application-perspectives

    curl -k -s --request GET 'https://<instana server hostname>/api/application-monitoring/applications' --header 'Authorization: apiToken <api token>' | jq '.items|length'
    
  • number-of-services

    curl -k -s --request GET 'https://<instana server hostname>/api/application-monitoring/services' --header 'Authorization: apiToken <api token>' | jq '.items|length'
    

IBM Cloud Pak for AIOps cannot close some alerts when an Instana integration exists

If you have an Instana integration created and have Instana 221 (Saas or Self-Hosted), you might encounter an issue where IBM Cloud Pak for AIOps might not be able to close some alerts. Instead, you might need to check the status for the associated event within the Instana dashboard and clear the alert manually. For more information, see Troubleshooting integrations: Instana Event integration.

When creating a Kubernetes observer job that is named 'weave_scope', 'load', 'kubeconfig', or 'local', the job fails to run

If you create a Kubernetes observer job with the name weave_scope, load, kubeconfig, or local, the job always fails to run. When this error occurs, you can view an error icon in the schedule column for the job. To avoid this issue, do not use these names for the observer job.

Log Anomaly 8k limit on field mapping in details field of the alert schema

The limitation is that the datalayer imposes an 8 kb size limit on the details field in the alert schema. The details field is populated by the log anomaly event, which provides the relevant information to display in slack when trying to view alerts in the chatops. Whenever the details field size exceeds 8 kb, the returned json object is truncated and therefore when the user clicks view alerts to retrieve the alerts related to an incident, expected results are not seen and an error is recorded.

The current fields under the details objects are:

 end_timestamp: int
 original_group_id: str
 causality: dict
 detected_at: float
 source_application_id: str
 log_anomaly_confidence: float
 log_anomaly_model: List[str]
 prediction_error: dict
 error_templates: List[int]
 count_vector: List[int]
 text_dict: dict
 application_group_id: str
 application_id: str
 model_version: str
 severity_from_model: int
 description: str

Log integration does not start when multiple log integrations are active

If you have many active log integrations, such multiple Kafka, ELK, Splunk, and Falcon LogScale integrations, and you create and enable another Falcon LogScale integration, you might notice that the integration status is stuck in an error or restarting state. This state can even occur after the integration is operating as expected.

This can occur if you exceed the limit for the number of jobs that can run on the underlying service, which results in insufficent resource available to start the integration. To resolve this issue, complete one or more of the following tasks:

  • Increase the replica count of your task managers.
  • Increase the task manager count per replica.
  • Change the parallelism of your integrations.
  • Cancel other integrations

After a restore, data from integrations are not processed

If you have an integration that you are restoring, the status for these integrations can be in error after the restore process completes. To resolve this status, you need to edit and save your integrations with the Integrations in the IBM Cloud Pak for AIOps IBM Cloud Pak for AIOps console. Editing the integration regenerates the associated Flink job for the integration, which updates the status.

Historical data from ServiceNow instance gets collected only when the historical data flow is reenabled

If you enable historical data flow for a ServiceNow integration, you might notice that the historical data is not collected from ServiceNow. For instance, when you check the grpc-snow pod, you can see ticket data available, but when you check the Flink job or in Elasticsearch, you can notice that no data was collected. If this issue occurs, turning off the historical data flow and turning it back on can cause the data to begin to be collected.

Dynatrace integration pod restarted and does not retrieve all historical data

If you have a Dynatrace integration created and pull historical data with multiple metrics for Metric Anomaly AI Training, you can encounter an issue where the Dynatrace pod restarts, but does not complete retrieving the expected historical data for training. This issue can occur intermittently, depending on the number of metrics that are selected for the integration and the amount of data to be retrieved.

If this potential out-of-memory or out-of-resources issue occurs, consider creating separate integrations to monitor different and smaller sets of metrics. By splitting the integrations, you can reduce the amount of data to be retrieved through the initial integration that can cause this issue.

ServiceNow user account gets locked out after a few hours

If there is an active ServiceNow integration with data collection enabled and the ServiceNow credentials change, the ServiceNow user account can get locked out. ServiceNow has an automatic login locking script called "SNC User Lockout Check", which locks users out after more than 5 failed attempts (including any failed API calls).

If you check the Incidents and alert page, you will see also an alert saying "ServiceNow instance authentication failed".

When this problem occurs, unlock the user in ServiceNow. Then change the password in the ServiceNow integration and save. When authentication fails in the ServiceNow integration, there is a 1-minute wait time before you can access it, to prevent a lockout from occuring quickly.

Scale resources when running log anomaly training on large data

In some cases it is observed that log anomaly training fails on large data due to being Out Of Memory (OOM) or if there is a problem with ES shards. The solution is to scale up the resources to handle large data training.

For more information about shard management, see About indices and shards. For more information about increasing ES Resources, see Log anomaly training pods CPU and Memory resource management.

The integration status for Elk, Custom Logs, Mezmo, and Falcon LogScale sometimes shows 'not running' even though the Flink job and gRPC pod are running correctly

On creating a integration, the Flink job retrieves data normally and the gRPC pod is running without error. However, the console shows that the integration status is 'not running'.

Log data integrations status is "Done" even though historical data is still loading

When a log data integration (Falcon Logscale, ELK, Mezmo, Custom, Splunk) is running in Historical data for initial AI training mode, and a custom regex is added in the field_mapping section, the data processing can take a long time. Although the Data collection status might be shown on the UI as Done, data could still be being processed and written to Elastic in the background.

To speed up this process, you can increase the Base parallelism number that is associated with that integration. For more information, see Increasing data streaming capacity.

IBM Tivoli Netcool/Impact integration stops event processing with exceptions

If you have an IBM Tivoli Netcool/Impact integration, you can encounter an issue where the integration temporarily stops processing during the sending of an event stream to IBM Cloud Pak for AIOps.

This issue can occur when you have an IBM Cloud Pak for AIOps policy that triggers an IBM Tivoli Netcool/Impact policy when certain types of events are received. If this issue occurs and stops the event processing, the Impact integration logs or Impact policylogger logs can include messages that are similar to the following example exceptions:

[6/14/23, 11:38:45:816 UTC] 0000005d ConnectorMana W failed to send status update
...
[6/14/23, 11:38:45:815 UTC] 000023ca StandardConne W configuration stream terminated with an error
...
[6/14/23, 11:38:45:816 UTC] 000023cc GRPCCloudEven W consume stream terminated with an error: channel=cp4waiops-cartridge.lifecycle.output.connector-requests

If you encounter this issue, you might need to restart the impact-connector pod to begin the processing of the event stream again.

IBM Tivoli Netcool/Impact integration fails for IBM Tivoli Netcool/Impact server with non-default cluster name

If the IBM Tivoli Netcool/Impact cluster uses a non-default cluster name ("NCICLUSTER"), the integration can fail to validate the integration. The IBM Tivoli Netcool/Impact server may report DynamicBindingException errors in the impactgui.log:

com.micromuse.common.nameserver.DynamicBindingException: DynamicBindingException: Service [NCICLUSTER] not in nameserver.

To resolve the issue, wait for the backend IBM Tivoli Netcool/Impact server to finish initializing before starting or restarting the Impact GUI server.

If IBM Tivoli Netcool/Impact is running fix pack 7.1.0.26 or later, you can also resolve the issue by setting the nameserver.defaultcluster property in the GUI server. Add the following line to $IMPACT_HOME/etc/nameserver.props:

impact.nameserver.defaultcluster=CLUSTERNAME

where CLUSTERNAME is the name of the IBM Tivoli Netcool/Impact cluster.

IBM Cloud Pak for AIOps and IBM Netcool Operations Insight

IBM Cloud Pak for AIOps Strimzi Kafka topics created without replication

IBM Cloud Pak for AIOps supports multiple replications of Kafka topics for large production installations, such as for data redundancy. The IBM Cloud Pak for AIOps console can automatically create Kafka topics when integrations are created. When a topic is dynamically created by the IBM Cloud Pak for AIOps console, the replication is set to 1 in the controller. As such, Kafka topics the are created during installation can have multiple replicates, but those topics that are created dynamically do not.

If you are implementing a production (large) deployment of IBM Cloud Pak for AIOps, you might lose data if your Kafka pods fail or restart. If the data flow is enabled in your Kafka integration when the Kafka pods go down, you might experience a gap in the data that your integration generated during that down period. Upgrades or updates to workers can cause a Kafka broker restart.

You can manually modify the Kafka topic replication inside the Kafka container from a value of 1 to 3 to mitigate any potential data loss from this issue.

IBM Cloud Pak for AIOps pods not starting after a cluster restart

When the cluster is restarted, all nodes have a STATUS of Ready. Most pods return to a STATUS of Running except for some IBM Cloud Pak for AIOps pods.

One potential cause is that Elasticsearch must be up and running before the IBM Cloud Pak for AIOps pods can start.

Restart the Elasticsearch pod to get all pods back to a STATUS of Running.

NOIHybrid incorrectly listed in provided APIs for IBM Netcool Operations Insight operator

NOIHybrid is incorrectly included in the Provided APIs list for the IBM Netcool Operations Insight operator. This list is displayed in the Red Hat OpenShift Container Platform web console under Installed Operators > Netcool Operations Insight > Operator Details. Do not use NOIHybrid APIs.

Rare issue: unable to deploy log anomaly detection model

Very occasionally, following successful completion of log anomaly detection training, an error similar to the following is displayed in the AI management UI training page following an attempt to deploy the model.

Error
Model deployment failed

Within the error textbox, you will also see the text "Forbidden".

If you investigate the aiops-ai-model-ui pod logs, you will also see the following error.

ForbiddenError: invalid csrf token

If this occurs, first refresh the browser and try to deploy again.

If that does not remedy the situation, then log out and log back in, and then try to deploy the model again.

Connector experienced a failure due to a Bad Request for connection when data flow was enabled

When data flow is enabled for the Splunk connector, the connection fails with a Bad Request for connection error.

Workaround: Check that the models are deployed correctly. Then deploy the models in the AI hub UI.

GitHub connector issues with similar tickets and adding assignees to mappings

The following known issues have been obeserved with the GitHub connector:

  • In the Incident Overview > Add tickets to this incident panel, GitHub issues are missing the Updated by information. Additionally, searching for GitHub similar past resolution tickets from the Source drop-down menu in this panel will not display any tickets.
  • The GitHub connector might be missing from the list of integrations in training modules such as Similar tickets and Change risk.
  • The default issue mappings in a GitHub integration does not have assignees. However, if assignees are added to the mappings issues are not created in GitHub.

ServiceNow ticket contains too much text

If the ServiceNow change request, incident, or problem contains a large amount of text, such as work notes with close to 150,000 characters, the ticket is dropped and a warning is logged in the pod log. Dropping the ticket affects change risk for that ticket, as it will not occur or that ticket will not be used for similar ticket detection.

ServiceNow user account locked out

If there is an active ServiceNow integration with data collection enabled and the ServiceNow credentials change, the ServiceNow user account can get locked out. ServiceNow has an automatic login locking script that is called "SNC User Lockout Check", which locks users out after more than five failed attempts (including any failed API calls).

If you check the Incidents and alert page, you see also an alert saying "ServiceNow instance authentication failed".

When this problem occurs, unlock the user in ServiceNow. Then, change the password in the ServiceNow integration and save. When authentication fails in the ServiceNow integration, there is a 1-minute wait time before you can access it, to prevent a lockout from occurring quickly.

ServiceNow observer status given as Unknown for an extended period

There is currently a known issue whereby the status of the ServiceNow observer in the UI may be shown as Unknown for an extended period (up to 15 minutes) following a successful initialization before finally changing to Running. This issue is intermittent. It does not happen every time the ServiceNow observer is started, and it does not affect data collection.

Integrations unable to pause and resume both inbound and outbound data flow

Integrations with bi-directional data flow does not completely pause the flow of data both inbound to AIOPs and outbound from AIOps when the data flow is disabled in the UI.

Currently, when data flow is disabled, only the inbound data flow from the event source to IBM Cloud Pak for AIOps is disabled. Outbound data (in the form of actions) are still being pushed from IBM Cloud Pak for AIOps to the event source. This could cause some alerts being out of sync between the two systems.

Workaround: As a workaround you can resynchronize the alerts from Netcool and AIOps using the following steps:

  1. Disable the existing Netcool connector dataflow in the UI. This is to ensure that the connector releases any file locks before deletion.
  2. Get the existing Netcool connection ID.
  3. Delete the Netcool connection from the UI.
  4. Clear the AIOpsAlertId and AIOpsState columns in Netcool alerts.status table.
  5. Close AIOps Alerts using the connectionId as a filter.
  6. Install a new Netcool connection.

Getting the existing Netcool connection ID

  1. Get the connector name.

    oc get connectorconfiguration -l aiops.connector.type=netcool-connector
    
  2. Set the connectionname variable with the connection name and namespace variable with the AIOps namespace.

    connectionname="netcool"
    namespace="aiops"
    
    oc project $namespace
    connconfig=$(oc get connectorconfiguration --no-headers | grep "$connectionname" | awk '{print $1}')
    connconfiguid=$(oc get connectorconfiguration $connconfig -o jsonpath='{.metadata.uid}')
    
  3. Make a note of the connection ID value which will be used in the subsequent steps. Make sure the variable is not empty before proceeding to the next step.

    echo $connconfiguid
    

Clearing the AIOpsAlertId and AIOpsState columns in Netcool alerts.status table

  1. Login to the ObjectServers using the NCO_SQL ($OMNIHOME/bin/nco_sql) utility and execute the following command:

    -- Clear AIOps columns
    update alerts.status set AIOpsAlertId='',AIOpsState='';
    go
    
  2. Exit the NCO_SQL utility.

Closing AIOPs Alerts using the connectionId as a filter

  1. Find one of the ir-core-ncodl-std pods.

    $irCorePod=$(oc get pods --no-headers | grep 'ir-core-ncodl-std' | awk '{print $1}' | head -n 1)
    
  2. Get into the pod terminal.

    oc exec -it $irCorePod -- /bin/bash
    
  3. Update the command and replace <connectionId> with the value from the previous command. This command will list all alerts matching the filter.

    curl --insecure -X GET -H "Content-Type:application/json" -H "Accept:application/json" \
    --user "${API_AUTHSCHEME_STATICKEY_USERNAME}:$(cat ${API_AUTHSCHEME_STATICKEY_PASSWORD_FILE})" \
    "https://localhost:10011/irdatalayer.aiops.io/v1/cfd95b7e-3bc7-4006-a4a8-a73a79c71255/alerts?filter=sender.connectionId%3D%27<connectionId>%27"
    

    Example command with a4e3c212-5f84-4be3-989b-d3f293f0183e as the sender.connectionId filter.

    curl --insecure -X GET -H "Content-Type:application/json" -H "Accept:application/json" \
    --user "${API_AUTHSCHEME_STATICKEY_USERNAME}:$(cat ${API_AUTHSCHEME_STATICKEY_PASSWORD_FILE})" \
    "https://localhost:10011/irdatalayer.aiops.io/v1/cfd95b7e-3bc7-4006-a4a8-a73a79c71255/alerts?filter=sender.connectionId%3D%27a4e3c212-5f84-4be3-989b-d3f293f0183e%27"
    
  4. Execute the following command to close alerts matching the sender.connectionId filter.

    curl --insecure -X PATCH -H "Content-Type:application/json" -H "Accept:application/json" \
    --user "${API_AUTHSCHEME_STATICKEY_USERNAME}:$(cat ${API_AUTHSCHEME_STATICKEY_PASSWORD_FILE})" \
    "https://localhost:10011/irdatalayer.aiops.io/v1/cfd95b7e-3bc7-4006-a4a8-a73a79c71255/alerts?filter=sender.connectionId%3D%27a4e3c212-5f84-4be3-989b-d3f293f0183e%27" -d '{"state": "closed"}'
    
  5. Alerts from the old connection should now be closed.

Dynatrace topology does not support proxy target system

The Dynatrace topology feature in the Dynatrace metrics, events, and topology integration does not currently support a proxy target system.

Error messages in the agent log when the Dynatrace integration is disabled

There is currently a known issue whereby you may see the following error messages if you disable the Dynatrace integration:

java.lang.ArrayIndexOutOfBoundsException
Skipping concurrent rule based metrics collection

Dynatrace topology unable to create a second instance on the same cluster

Currently Dynatrace topology only allows you to create a single local deployment in a cluster. If you attempt to create a second Dynatrace topology instance using another name, the attempt will fail and no new pods will be created in the cluster.

Output to Db2 via the connector is not keeping up

There is currently a known issue with the IBM Db2 integration whereby even at a very low alert rate, the output to the Db2 instance using the connector is not able to keep up. If you are expecting an alert rate of around 700/s, you will see the data in your DB2 instance appear with a delay in seconds.

Workaround: Increase the resource management and HPA on both the IntegrationRuntime and Db2 connector pods.

When using the Db2 integration, data may not be in sync when using an update trigger in a policy

There is currently a known issue with the IBM Db2 integration whereby the data in the INCIDENTS_REPORTER_STATUS table may not be in sync with AIOps updates if you are using an update trigger in the policy.

When there is just a create trigger in the policy, the data is valid. But if you are using an update trigger, the updates are happening asynchronously, so the latest update can be overridden with the previous update if they are happening at the same time.

There is no workaround currently available.

Db2 integration shows a status of Running regardless of whether correct login details have been specified

There is currently a known issue with the IBM Db2 integration whereby the Data collection status of the Db2 integration always displays a green tick to indicate that it is running even if incorrect login details have been specified when creating or updating the IBM Db2 integration.

There is no workaround currently available. If the Db2 connector does not return any table or the Db2 database appears not to be receiving incident or alert data, check the log files to see whether you have specified the Db2 login details correctly, and update the Username and Password fields in the IBM Db2 integration accordingly.

Integrations stuck in Initializing state

There is currently a known issue whereby integrations occasionally appear to be stuck in the Initializing state.

This issue will resolve itself after a delay.

ServiceNow incidents are missing on historical pull

If you do a historical pull of incidents on GitHub or Jira first and then ServiceNow after, you might notice that the ServiceNow incidents are missing.

The problem is due to changes in the ServiceNow incident schema. To resolve this issue, delete the snow incident index, which all the ticket systems use. Use the following steps to fix the issue with the schema:

  1. Open a terminal window.

  2. Run the following 4 commands to enable the port forwarding:

    export EL_SECRET_NAME=`oc get AIOpsEdge aiopsedge -o jsonpath='{.spec.elasticsearchSecret}'`
    export EL_USER=`oc get secret $EL_SECRET_NAME -o go-template --template="{{.data.username|base64decode}}"`
    export EL_PWD=`oc get secret $EL_SECRET_NAME -o go-template --template="{{.data.password|base64decode}}"`
    oc port-forward aiops-ibm-elasticsearch-es-server-all-0 9200:9200
    
  3. Open another terminal window and run the following 4 commands to delete the snow incident index:

    export EL_SECRET_NAME=`oc get AIOpsEdge aiopsedge -o jsonpath='{.spec.elasticsearchSecret}'`
    export EL_USER=`oc get secret $EL_SECRET_NAME -o go-template --template="{{.data.username|base64decode}}"`
    export EL_PWD=`oc get secret $EL_SECRET_NAME -o go-template --template="{{.data.password|base64decode}}"`
    curl -X DELETE --user $EL_USER:$EL_PWD https://localhost:9200/snowincident/ -k
    
  4. Re-run the ServiceNow historical pull to populate the index again. Your incidents now show up and can be used for similar incident training. If ServiceNow is used first, then historical pulls to GitHub and Jira do work.

Topology status processing of Instana events slows over time

There is currently a known issue whereby the Topology Status pod may run more slowly over time due to the way Instana events are being processed by IBM Cloud Pak for AIOPS, until ultimately being unable to keep up with incoming alerts. As a consequence, the Instana Event Collector should not currently be used and a webhook should be used for the collection of events from Instana instead.

For details about creating a webhook integration, see Creating Generic Webhook integrations.

For details of how to configure a webhook as an alternative way of collecting Instana events, see the Utilise AIOps Generic Webhook with JSONata mapping to ingest Instana alerts blog.

Instana connector not correctly generating event clears for events incoming

The deduplicationKey that is generated by the Instana connector is dependent on the resourceName field. There is a possibility that for a given Instana event, different values for resourceName can be generated based on the timing of the topology call. This can result in duplicate alerts as well as orphaned alerts, as the closes do not match the initial alert that was created.

Workaround: The Instana Event Collector should not currently be used and a webhook should be used for the collection of events from Instana instead.

For details about creating a webhook integration, see Creating Generic Webhook integrations.

For details of how to configure a webhook as an alternative way of collecting Instana events, see the Utilise AIOps Generic Webhook with JSONata mapping to ingest Instana alerts blog.

New Relic observer job fails with an authorization error

You might notice that the New Relic observer job fails due to an authorization error. The issue occurs due to changes in the New Relic API.

The error can resemble the following example:

ERROR  [2025-02-06 16:02:51,209] [pool-12-thread-1] c.i.i.t.o.n.j.NewRelicLoadJob -  Failed to validate connection: javax.ws.rs.NotAuthorizedException: HTTP 401 Unauthorized

There is no workaround currently available.

Applications and topologies

Composite resources with differing geolocation markers are plotted separately in the Resource map

On rare occasions a composite resource may contain more than one geolocation marker. All of these will be plotted on the Resource map. If one of these locations falls outside the displayed map area, its status is not displayed.

Workaround: None. Be aware of this quirk when viewing composite resource on the Resource map.

Resources with a _compositeId value have 'Related Services' and 'Related resource groups' tabs disabled

Group and service information cannot be fetched for resources that are part of a composite.

This defect is only encountered when viewing the related service or related resource group details of a resource that is part of a composite. This resource will have the 'Related services' and 'Related resource groups' tabs disabled.

Workaround: none

JVM heap out-of-memory (OOM) failures when loading large number of resources

When running topology loads in quick succession, it is possible to experience some OOM errors and undesired topology pod restarts, even though the pods will continue the processing after restarting.

This error can occur when running resource loads of several millions in a large deployment, and could slow down the loading process. The following type of error message can be seen in the pod logs:

WARN [2022-10-25 15:43:31,906] [JanusGraph Session-io-4] c.d.o.d.i.c.c.CqlRequestHandler - Query ‘[4 values] SELECT column1,value FROM janusgraph.graphindex WHERE key=:key AND column1>=:slicestart AND column1<:sliceend LIMIT :maxrows [key=0x02168910cfd95b7e3bc74006a4a8a73a79c71255a0726573...<truncated>, slicestart=0x00, sliceend=0xff, maxrows=2147483647]’ generated server side warning(s): Read 5000 live rows and 1711 tombstone cells for query SELECT value FROM janusgraph.graphindex WHERE key = 02168910cfd95b7e3bc74006a4a8a73a79c71255a07265736f757263e5 AND column1 > 003924701180012871590290012871500290 AND column1 < ff LIMIT 5000; token 9157578746393928897 (see tombstone_warn_threshold) JVMDUMP039I Processing dump event “systhrow”, detail “java/lang/OutOfMemoryError” at 2022/10/25 15:43:32 - please wait. JVMDUMP032I JVM requested System dump using ‘/tmp/cassandra-certs/core.20221025.154332.1.0001.dmp’ in response to an event

Cause: Not enough headroom exists between JVM memory limit and the pod memory limit, usually because one was increased without also increasing the other.

Workaround: Ensure that any changes in heap size maintain enough headroom between these settings.

Example: In this example (for a topology size1) the pod limits are set to 3.6 GB while the maximum memory for the JVM (-Xmx) is set to 3 GB, thereby leaving 0.6 GB of headroom free for use by the OS.

size1:
   enableHPA: false
   replicas: 2
   jvmArgs: "-Dcom.ibm.jsse2.overrideDefaultTLS=true -Xms1G -Xmx3G"
   resources:
      requests:
         memory: "1200Mi"
         cpu: "2.0"
      limits:
         memory: "3600Mi"
         cpu: "3.0"

Critical error message displayed when attempting to render an application

This problem occurs when all of the groups, within the application you are attempting to render, have no members. When the application is selected in Application management, it does not render and a critical error message is displayed on the UI.

Avoid creating applications with no members. If an application with no members was created for test purposes only, then ignore this error.

Different date and time in Cloud Pak for AIOps console and ChatOps between users

The date and time format for an Incident in the IBM Cloud Pak for AIOps console Application management tool and the associated ChatOps notification can be different between users. The format and time zone that is used in the Cloud Pak for AIOps console and ChatOps notification is set to the user's locale. If different users are in different time zones, the displayed date and time are different in the Cloud Pak for AIOps console and ChatOps notification.

Deleting a tag template can cause out-of-memory errors

If a tag is applied to a large number (that is, thousands) of topology resources, then deleting the tag template can cause out-of-memory errors with the topology-merge pod.

Avoid creating tag templates that use tags that occur with such frequency. Later, do not tag thousands of resources with the same tag, and avoid them being used in a group.

The Find Path tool ignores filters

The topology path tool fails to launch with filters applied.

Launch the path tool without filters, then manually apply the filter settings on the path page.

Probable cause is not producing accurate results

The correlation algorithms for probable cause currently require the use of a Kubernetes model with service-to-service relationships, or the use of dependency relationships between non-Kubernetes resources.

Complete the following steps to create the required relationships for Kubernetes. This procedure configures Topology Manager to overlay relationships provided by the File observer onto the Kubernetes topology.

Note: The Kubernetes observer must be configured and loading data.

  1. Log in to the IBM Cloud Pak for AIOps console.

  2. From the main navigation, expand Operate and click Topology viewer.

  3. From the topology navigation toolbar, expand Settings, click Topology configuration.

  4. On the Rules tile, click Configure to navigate to the Rules administration page.

  5. On the Merge tab, click New to create a New merge rule.

    In this scenario, data that is provided by the File observer will be used to add the relationships.

  6. Specify the following information on the New merge rule page:

    1. Rule name: k8-file-service.

    2. Set the rule Status to Enabled.

    3. Add the uniqueId property to the set of Tokens.

    4. Expand the Conditions section and select File and Kubernetes from the set of available Observers and click Add.

    5. Specify service for Resource types and click Add.

    6. Click Save to save the new Merge rule.

  7. Locate the services that you want to relate together and make a note of their source-specific uniqueId, such as 05f337a1-5783-43bb-9323-dfba941455c7 (shipping) and ae076382-3df9-46cb-97e9-a0342d219efb (web).

  8. Create a file for the File Observer that contains the service-dependsOn-service relationships necessary for the correlation algorithms to work.

    The following example creates two services, web and shipping, and states that web dependsOn shipping. Repeat this as required to relate your services together.

    V:{"uniqueId": "05f337a1-5783-43bb-9323-dfba941455c7", "name": "shipping", "entityTypes": ["service"]}
    V:{"uniqueId": "ae076382-3df9-46cb-97e9-a0342d219efb", "name": "web", "entityTypes": ["service"]}
    E:{"_fromUniqueId":"ae076382-3df9-46cb-97e9-a0342d219efb", "_edgeType":"dependsOn", "_toUniqueId":"05f337a1-5783-43bb-9323-dfba941455c7”}
    
  9. Load this file into Topology Manager to relate the services. For more information, see Configuring File Observer jobs.

    If your topology changes, then re-create and reload the file as required. A similar process can be followed for non-Kubernetes sources.

High volumes of data can cause Spark workers to run out of memory

If your environment handles high (10+ millions) workloads of alerts or event, your Spark workers can run out of storage (ephemeral storage). If you encounter this issue, restart the affected Spark workers. This issue can also occur if you are running multiple jobs, which can cause the file system to fill, such as with log or JAR files.

Azure observer missing subnet relationship in topology

For the Azure Observer, a subnet can be intermittently missing the relationship with an IP address in the topology for a resource. While the relationship can be intermittently missing, both the subnet and IP address verticies remain available in the topology.

Topologies not visible on Incident Topology page after resource merge

When resources from two observer sources have been merged using the topology merge functionality, the topology is no longer displayed in the Incident view. This known issue affects only the Incident view, and the topology is still present in all other views.

Openstack observer missing edge-runsOn connectivity in topology

After running an Openstack observer job, the edge-runsOn connectivity between ComputeHost and Hypervisor elements is not shown in Resource Managenent -> Resources when it should be.

Topology viewer user interface crashes after the update manager is displayed

The issue is that you cannot use the update manager feature in topology viewer. The solution is to change your topology viewer user preferences to auto render changes on refresh, which prevents the update manager from appearing.

Infrastructure Automation

Kubernetes permissions are missing for user roles for using Managed services and the Service catalog

If you install Infrastructure Automation, you, or an administrator, must add the required Kubernetes permissions to user roles before your users can begin to access and use Managed services or the Service catalog.

As an administrator, add the following permissions to your use roles:

Table. Required permissions
Role Required permission for Infrastructure Automation
Automation Administrator Administer Kubernetes resources
Automation Operator Manage Kubernetes resources
Automation Developer Edit Kubernetes resources
Automation Analyst View kubernetes resources

For more information about how to add permissions to a role, see Managing roles for Infrastructure Automation.

Non-LDAP users cannot access Infrastructure Management

Non-LDAP authenticated users cannot be used with Single Sign-On for Infrastructure Management. If you attempt to use Infrastructure Management with a non-LDAP authenticated user, you can encounter the following error:

While logged in to the Infrastructure Automation UI console with a non-LDAP user, attempting to start Infrastructure Management fails with an error. This is a limitation.

The error states:

OpenID Connect Provider error: Error in handling response type.

Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core are not supported

Installation of Infrastructure Management in IBM Cloud Pak for AIOps does not support Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core. You can continue to use the Kubernetes cluster life-cycle templates and services to create a Kubernetes cluster and import the cluster to an existing installation of Red Hat Advanced Cluster Management, if an installation is available. Deploying hybrid applications are also not supported by Infrastructure Automation.

Users are redirected to the Administration panel when logging back into the UI

When you are working within Infrastructure Automation and log out and then log back in, you can be redirected to the Administration panel instead of the Infrastructure Automation home page or other page that you were previously using. If this occurs, you can use the Cloud Pak switcher in the upper right of the UI console to switch to the Infrastructure Automation home page and then return to the page that you were previously using.

Database fails to reset when error occurs during database creation for Infrastructure Management

If you are creating the database for the Infrastructure Management appliance and you encounter an error, such as the database creation failing to complete successfully, you might not be able to continue with your setup without redeploying. For instance, if the creation fails, resetting the database to clean up your database and deployment can also fail. To resolve this issue, you need to redeploy the Infrastructure Management appliance image before reattempting to create the database.

The cam-tenant-api pod is not in a ready state after installing the iaconfig CR

After you install Infrastructure Automation, you can encounter an error where the cam-tenant-api pod displays as running, but not in a ready state. When this error occurs, you can see the following message:

[ERROR] init-platform-security - >>>>>>>>>> Failed to configure Platform Security. Will retry in 60 seconds <<<<<<<<<<<<< OperationalError: [object Object]

If this error occurs, delete the cam-tenant-api pod to cause the pod to restart and attempt to enter a ready state.

Print or export as PDF entire tables does not work as expected

If you are using the Firefox browser, and you select Print or export as PDF on the Compute > Containers > Projects page to print or export the entire table of data, the print, or export might not work as expected. For instance, some data, such as table rows might be missing. If you encounter this issue, try a different browser for printing or exporting the data.

Infrastructure Management log display in the UI is removed

Log display support on the UI is removed for Infrastructure Management. As an alternative for viewing these logs, use Kubernetes standard methods such as oc log commands, viewing the output in Red Hat OpenShift Container Platform or Kubernetes, or setting up a log aggregator for your cluster.

You can still see the log tabs (Collect Logs, IA:IM Log, Audit Log, and Production Log) on the Settings > Application Settings Diagnostic page. However, instead of displaying the log information, the following message is displayed: Logs for this IA:IM Server are not available for viewing.

Infrastructure Automation Test deploy fails

Infrastructure Automation Test deploy from a Service Overview page fails to deploy.

On Infrastructure Management appliances, an Ansible playbook deployment fails

When you attempt to deploy an Ansible playbook on an Infrastructure Management appliance through an embedded Ansible deployment, the playbook deployment can fail with the following error:

<35.237.119.31> ESTABLISH SSH CONNECTION FOR USER: ubuntu
fatal: [35.237.119.31]: FAILED! => {
"msg": "Unable to create local directories(/home/manageiq/.ansible/cp): [Errno 13] Permission denied: b'/home/manageiq'"
}

If you encounter this error, log in to the appliance as the root user and then deploy the playbook again:

  1. Run the command:

    mkdir -p /home/manageiq
    
  2. Run the command:

    chown manageiq:manageiq /home/manageiq
    
  3. Deploy the Ansible playbook again.

After restoring Managed services from a backup, the Managed services deployment fails

After you restore Managed services (cam) from a backup, the deployment instance fails with a socket hang up error.

If this error occurs, restart the cam-iaas pod by running the following command:

oc delete pod <cam-iaas-xxxx> -n <namespace>

Where <namespace> is the project (namespace) where Infrastructure Automation is installed, and <cam-iaas-xxxx> is the name of the cam-iaas pod to restart.

With this restart, the service deployment can complete successfully.

Infrastructure Management fails to save container provider after changing to a new token

Infrastructure Management fails to save changes to a container provider after updating the access token when the Metrics collection is enabled.

Workaround: After updating and validating the new token in the Edit provider dialog box, switch to the Metrics tab and validate the existing endpoint. The Save button is now enabled.

Infrastructure Automation install fails on FIPS enabled Power cluster

There is an intermittent issue with the installation of standalone Infrastructure Automation on a FIPS enabled Linux on Power (ppc64le) cluster. This occurs when AllNamespace or OwnNamespace mode is used. A problem with the events operator pod causes the installation to fail.

Embedded Terraform feature not supported with FIPS enabled OpenShift Container Platform cluster

The Embedded Terraform feature in Infrastructure Automation is not yet supported in a FIPS enabled OpenShift Container Platform cluster.

Auto-generated Service Dialog contains the wrong values for variables of type Integer and Boolean

You might notice that the auto-generated Service Dialog contains the wrong types for Integer and Boolean variables.

The workaround for this issue is to manually edit the auto-generated Service Dialog. For the variable of type Integer, set the Value type to an Integer and Validation field to Yes.

For the variable of type Boolean, delete the field and replace it with a checkbox.

UI console

Tour icon disappears when browsing to another console page while a guided tour is running

When you start a guided tour and navigate away from the page to the IBM Automation Home page, the Tour icon might not display on the toolbar. This behavior can occur when an IBM Cloud Pak for AIOps tour is still running. Only one guided tour can run at a time. To resolve this issue, return to an IBM Cloud Pak for AIOps page, click on the Tour icon and close a tour. When you return to the IBM Automation Home page, the Tour icon reappears.

'Data cannot be displayed': error given for 'Defined Applications' and 'Favorite Applications' tiles on the Home page

On opening the Console home page neither 'Defined Applications' nor 'Favourite Applications' is listed in their tiles. However, they do exist as they can be viewed under the 'Resource Management' section. The fix for this is to restart the aiops-base-ui pods.

User credential timeout starts in the backend

The first indications of a user credential timeout might be backend failures. For example, failure to load incident, alert, or policy lists. To resolve this issue if it occurs, log out and log back in again, or wait for the frontend logout to occur.

Unable to access Identity Providers link in console

You might notice that you are unable to access the Identity Providers link located in the UI console. You might see an error such as HTTP ERROR 431. If this issue occurs, configure the LDAP connection. For more information, see Identity Management (IM).

Slow loading pages

If you are experiencing slow loading of Cloud Pak for AIOps pages, it might be because the server TLS certificate is not trusted by your browser. Some browsers (for example, Microsoft Edge and Google Chrome) prevent caching of resources when the server’s certificate is untrusted. This means all static resources associated with the page are fully reloaded on every refresh, significantly slowing down page loads.

To resolve the issue, a certificate signed by a certificate authority that is trusted by your client devices should be used. Using a custom certificate is described in the Using a custom certificate page. This may be a certificate signed by a well known certificate authority, or an internal certificate authority pre-configured on your client devices.

Alert Viewer page shows 400 Error

You might notice that you are unable to access the Alert Viewer page. You might see a 400 error. This error is caused when the cookies for the IBM Cloud Pak for AIOps domain in your browser get too large for the data layer service to consume.

To resolve this issue, clear the cookies in your browser and then reload the page.

AI Model management and training

Log parsing assigns messages to catch-all template instead of generating expected template

If you use catch-all templates for mapping uncategorized messages during AI model training, you can encounter an issue where the log parsing assigns messages for an error to the catch-all templates instead of generating an expected template for that error. If this issue occurs, you might not see expected anomalies.

If you suspect this issue is occurring and you do not see expected anomalies, complete the following steps to manually verify your training templates, and remove any catch-all templates that incorrectly generated.

  1. Retrieve the normalized logs from your logtrain indices.

  2. Identify the logs that are error logs. Review those logs to determine the template mappings.

  3. Retrieve the identified templates from Elasticsearch.

  4. Use the error log contents and the template ID from the retrieved normalized logs to identify the template string within the retrieved templates.

  5. If the template string is comprised entirely of parameters, or a single word and parameters, the template might be a catch-all template. For example, the following string is an example of a catch-all template:

    <>to <><><><><>
    <> <> <> <>-<>-<> <> <> <> <> <> <> <> <> <> <>
    
  6. Manually delete any catch-all templates.

Elasticsearch record count does not match record count published to Kafka topic

When you push a large training file (for example 60 M records, such as logs or events) to Kafka through your configured integration, the number of records that are ingested and displayed on Elasticsearch might not match. Elasticsearch record count might be lower than Kafka count due to deduplication. If you encounter this issue, split large files into smaller batches and send them individually to Kafka (for example, 5 M records each). When you are pushing a batch, ensure that you wait for an ingest to complete and the associated records display on Elasticsearch before you push the next batch of records.

Log anomaly detection EXPIRY_SECONDS environment variable not retained after an upgrade

If you set a value for the EXPIRY_SECONDS environment variable and upgrade, the environment variable is not retained after the upgrade.

After the upgrade is completed, set the environment variable again. For more information about setting the variable, see Configuring expiry time for log anomaly detection alerts.

Log anomalies are not detected by natural language log anomaly detection algorithm

In some cases a model that has been trained successfully is unable to detect certain log anomalies. The quality of the model is independent of whether it trained successfully, and model quality tends to improve as more training data is available. If the model is not detecting anomalies in your logs, consider training the model again but using additional days of training data to improve the model quality.

Metric anomaly detection training does not run on schedule

If you have metric anomaly detection training scheduled to run, such as daily, you can encounter an issue where the training does not run as scheduled. If the training job does not run on schedule, log in to the IBM Cloud Pak for AIOps console and click the Metric anomaly detection algorithm tile and then Train models.

In Change risk training, Precheck indicates “Good data” but models fail to create

On rare occasions, a Change risk model fails to create, even though Precheck data indicates that the data is good. This failure is caused by an insufficient number of problematic change risk tickets being available to create a good model. This problem resolves itself when enough tickets become available for the model. (For more information, see Closed change ticket count requirements).

To confirm that insufficient problem tickets is causing the failure, view the Change risk logs on the training pods.

To retrieve the pod:

oc get pod | grep training-cr

View the logs for training Change risk models.

oc logs <pod-name> # Ex: training-cr-1b5ef57f-9053-4037-95ca-c1e8b8748fc5

Check whether the log contains the following

size of the problematic (aka labels) tickets is insufficient

If confirmed, ensure that enough problematic change tickets are available before training the model again.

Alerts for the Log Anomaly - Golden Signals algorithm are not generated when inference log data contains name and value pairs

In IBM Cloud Pak for AIOps 4.8.1, alerts might not be generated for the Log Anomaly - Golden Signals algorithm when inference log data contains name and value pairs. These pairs are tokens with the key=value pattern. They might prevent anomalies from being matched to their respective templates.

For example, training generates the following log template:

<> exe="/usr/bin/dbus-daemon" sauid=UNKNOWN_VAR hostname=? addr=? terminal=?'

Then, during inference, incoming log data is matched against the template from training. If the incoming log data contains tokens with the key=value pattern like in the following example, the logs are classified as unmatched.

[3557470.922719]  exe=\"/usr/bin/dbus-daemon\" sauid=103 hostname=? addr=? terminal=?'
[3557971.107893]  exe=\"/usr/bin/dbus-daemon\" sauid=103 hostname=? addr=? terminal=?'

Alerts from this set of logs are not displayed.

Counts against template patterns are not updated in the training UI

When the log anomaly detection - golden signals algorithm generates alerts in IBM Cloud Pak for AIOps 4.8.1, the counts against those template patterns are not updated in the training UI table. Enable historic alert storing in Elasticsearch to access alert counts.

Workaround: Enable the components by editing the installation. Use the following commands to access the installation through the command line. Replace <installation-name> with the name of the installation:

oc get installation
oc edit installation <installation-name>

Or access the installation with the Red Hat® OpenShift® console. Go to Operators > Installed Operators > IBM Cloud Pak for AIOps > IBM Cloud Pak for AIOps and edit the YAML file for aiops-installation.

After you access the installation, edit it to include the following values:

spec:
  automationFoundation: {}
  license:
    accept: true
  pakModules:
    - config:
      - name: ir-core-operator   # Find the config item with this name, or add this item if it does not exist
        spec:
          issueresolutioncore:
            customSizing:
              deployments:
              - name: datarouting
                replicas: 1   # Use 3 for large deployment
              - name: esarchiving
                replicas: 1
      enabled: true
      name: applicationManager  # Find the pakModules item with this name, or add this item if it does not exist

After the components are enabled, new alerts are stored to Elasticsearch indices, and the alert counts for subsequent alerts are updated correctly in the training UI table.

Similar tickets training in IBM Cloud Pak for AIOps on Linux

Similar tickets training is not available in IBM Cloud Pak for AIOps on Linux. You can manually add tickets from your ticketing integrations in the incident overview.

ChatOps

In Slack, a ChatOps communication to IBM Cloud Pak for AIOps times out without establishing a connection

When sending a ChatOps communication from Slack to IBM Cloud Pak for AIOps, a known intermittent issue can occur. The communication between Slack and IBM Cloud Pak for AIOps can time out after 3 seconds of no response. A potential solution is to reconfigure your connection in the connection onboarding. Ensure that your Slack Bot can access your IBM Cloud Pak for AIOps instance in a prompt fashion. A more permanent and robust solution to this issue is being devised.

In a Microsoft Teams ChatOps, the attach template logs feature does not work

If you have a Microsoft Teams ChatOps, clicking the Attach template logs button does not work and the logs are not sent to your Microsoft Teams channel for review. As an alternative, use the Preview logs button to view the template logs.

Incidents cannot be reopened or restored in ChatOps

When an incident is closed it is archived and can no longer be modified. If a new alert occurs that is related to the archived incident, a new incident is created instead of reopening the archived incident.

No Recommended runbooks found in incident overview

If a runbook recommends to remediate an incident is deleted from the runbook Library, it does not remove the Recommended runbooks link in a ChatOps notification. This can result in the ChatOps runbook section linking out to an empty runbooks page in the incident overview.

No incident content viewable in Microsoft Teams on mobile devices

On mobile devices, when viewed in Microsoft Teams, incidents can appear with no viewable data. Where this happens, switch to using a computer to see the full incident data.

During a ChatOps secure tunnel creation an 'installation failed' message displays

When you create a ChatOps integration and it fails, wait for a few minutes to see if the installation retries, and if it does not simply create the integration again.

Incidents and alerts

Active incident count is wrong on the Resource management page

The Resource management page displays a number of incidents in the Active incidents column instead of displaying one incident with a number of alerts. The Application viewer displays the correct information, however.

This error can occur when related groups or applications are erroneously linked to the incident count.

Workaround: Verify the correct number of incidents and alerts on the Application viewer.

"An error occurred while fetching data from the server"

When viewing the Incidents UI or when creating a policy to assign runbooks to alerts, you might see the message "An error occurred while fetching data from the server". Or in the Alert Viewer, you might see "An unknown error occurred". If you encounter these error messages, complete the following steps to delete the aiops-ir-ui-api-graphql pod. The pod is then automatically re-created, which should resolve the error.

  1. Log in to your cluster by running the oc login command.

    oc login -u kubeadmin -p <password>
    
  2. Delete the aiops-ir-ui-api-graphql pod.

    oc delete pod -l component=aiops-ir-ui-api-graphql -n <cp4aiops_namespace>
    
  3. Wait for the pod to restart.

Alerts tab shows "An unknown error occurred" error when all alerts are closed

If you are viewing a closed incident that has all associated alerts resolved, you can encounter an error when you view the Alerts tab. This "An unknown error occurred" error displays when there are no associated alerts. You can ignore this error message as the incident and alerts are resolved and closed.

Same runbook status for multiple alerts in an incident

If more than one alert meets a runbook policy's conditions, the same runbook can be assigned to multiple alerts in an incident. From the incident overview page, you can select an alert and run an associated runbook. The runbook Status of the selected alert will be updated on the UI. However, the runbook status of other alerts in the incident might be updated with the same status. This is a known issue.

Closed incidents are missing details and displaying a critical error

Incidents with status of "Closed" are missing topology information on the incident Overview tab. The associated alerts of the closed incidents are also missing from Alerts tab. The following critical error is displayed on the Topology tab for closed incidents: "No resource exists with the specified identifier and time point".

Metric anomaly chart unexpectedly changes from zoom view to normal view

This can occur when a related alert is selected and added to a chart you are zoomed in on. The chart resets to normal view. To resolve, zoom in again after related alert is added to chart.

Unable to add metric anomaly in Related alerts to chart

In some cases, when you click the checkbox in the Related alerts, it does not add the related anomaly to the metric anomaly chart.

Limitation of preview text for default recommended action

When an alert is generated from any default log anomaly detection models, the preview of a recommended action might contain partial texts and does not reflect the full view of the recommended action. The first 4000 characters are extracted from the original resolution or action document webpage where possible, from which nonreadable texts such as URLs are excluded to form the content of the preview text.

Metric search page chart lines are disjointed

In the Metric search page chart, the normalised forecast line is disjointed from the baseline data line. This is because the forecast data is normalised independently from the baseline data. Although the lines might not match up, the values shown in the tooltips are correct.

Some alerts not cleared even with a resolution event

In scenarios where large amounts of historical event data are ingested into the system, it's possible that problems and resolutions can be processed out of order, sometimes resulting in alerts not clearing as expected. To avoid this issue, try ingesting smaller batches of event data into the system.

Alert views are unusable without ID and SUMMARY columns

When creating views in the Alert Viewer, you must include the ID and SUMMARY columns in the view. Otherwise, the view will be unusable and can only be deleted by using the API.

Filter on short ID in the incident table doesn't work

You cannot use the short ID from the incident table as a filter condition under Other properties. Instead, use the full incident "id" which can be found in the incident details side panel > raw data format.

Policies

Condition Values field changes "String:" to "Value of:"

For example, in a policy condition if the string "alert.id" is typed in the Values field, and then "String:alert.id" is selected, it is changed to "Value of:alert.id".

To prevent this, avoid a string that exactly matches the keyword. In this example, use the following condition instead:

Policy Condition
Figure. Policy condition

Note: This example is not an exact match of the string alert.id. This workaround finds all summaries containing alert. and .id.

Breadcrumb navigation missing in policy editor

In some cases where a policy name is long, the breadcrumb navigation in the top left of the policy edit session can be abbreviated. Clicking the breadcrumb still returns you to the Policy UI.

Condition "Matches" field for numeric "Operator" selections

When using a numeric Operator in a policy condition set (for example, greater than, less than, greater or equal) all options can be selected under Matches. However, always select Only for use with a numeric operator.

"Last run" time and "Matched" count updated in policies other than the trigger policy

In a case where an alert meets the incident-creation conditions of multiple policies, only one incident is created. However, all policies that proposed an incident, and the system incident creation policy has the same Last run times on the Policies hub. Each of these policies also increment their Matched counts by 1 in the Details tab of the side panel.

Double scroll bars on browser window

An extra scroll bar might appear on the right side of the browser window.

Policy processing failing due to long policy conditions

Policies that have conditions that are long (100,000+ characters) can cause policy processing to fail, resulting in no alerts or incidents being created. Failure can occur when temporal correlation training generates groups containing many alerts, and those alerts have long resource or type fields.

If this problem occurs, such policies should be disabled from the automation policy UI. Identify the policies by

  1. Filtering for analytics policies.
  2. Sorting by last run time (to identify those that have been processed recently and are likely triggering the problem).
  3. Viewing the specification for each to see whether any have a long condition.

Any policies that show up by implementing these steps should be disabled.

No ServiceNow ticket created by incident creation policy

An alert meets the incident-creation conditions of a policy, but no ServiceNow ticket is associated to the incident.

To avoid this, complete the following steps under actions in the policy editor:

  1. Click Assign and notify.
  2. Select In an existing channel.

Datarouting pods can fail and need to be restarted

After installation, data for display in the UI, such as in the policy list, can be missing or stale.

If this issue occurs, restart the datarouting pods. You can identify these pods by using the following command:

oc get pods | grep datarouting.

Using default values in automatically run runbooks not working

This problem can occur by selecting a Default parameter value when creating a policy to assign runbooks to alerts. The useDefault value is not passed during automatic execution but is passed during manual execution.

You can execute the runbook manually from the Runbooks page.

Netcool alert not suppressed when X in Y suppression policy conditions are met

If you create an X in Y suppression policy that matches an alert originating from an IBM Tivoli Netcool/OMNIbus environment, the alert will not be suppressed.

Cloud Pak for AIOps Impact policy fails to run when 'No input' is selected

When enabling a Cloud Pak for AIOps Impact policy to Invoke an IBM Tivoli Netcool/Impact policy, the policy fails to run when No input is selected under the policy parameter mapping options. To avoid this, select one of the other mapping options. For example, select Send the full alert.

Policy triggered based on 'incident-updated' runs more times than expected

If you have a Cloud Pak for AIOps policy that is triggered based on incident-updated, in some cases, the policy might run more times than expected. This is because not all updates can always be processed atomically and the incident might then be updated multiple times. In turn, the incident-updated trigger will be activated more than once.

Parameter mapping issues in Netcool/Impact policies

The following known issues have been obeserved when creating policies to invoke IBM Tivoli Netcool/Impact:

  • The policy cannot be created with a parameter mapping option of No Input selected. The policy is instead saved with a Send the full alert mapping.
  • A Netcool/Impact policy cannot be edited to change the parameter mapping option. The change is not persisted. To workaround this problem, create a new Netcool/Impact policy with the new parameter mapping option and disable the current policy.
  • When a Netcool/Impact policy is created with a trigger entity of Incident and a parameter mapping option of Send the full incident, the mapping changes to the Customize option if the policy is edited.

Can't use alert.details.name with "any of" in a policy condition

When alert.details is selected in the property field or the value field, the Details name field is an optional input where you can minimize the scope to a singular key within the alert's details. However, Details name cannot be used in a policy condition with the Matches option of "any of".

Policy is triggered against alerts that don't match the condition

This can occur when you try to compare a NULL value in the policy condition. The policy is triggered because you can have a condition of NULL = NULL matched when the parameters referenced in the policy are not present.

To avoid this problem, you can ensure that the alert property is not equal to NULL in the condition set. For example, see the following policy conditions for alert.details.name:

Policy conditions to avoid comparing a NULL value
Figure. Policy conditions to avoid comparing a NULL value

WebSphere resolution action recommendation policy showing a failed status

The preset policy "WebSphere resolution action recommendation policy" might initially show a failed status in the policy table. If you encounter this, the policy status should self correct after a period of time.

Secure Tunnel

Secure Tunnel connector is not running after a restart with Podman

This issue can occur when you install the Secure Tunnel connector to a host machine on which Podman is installed. When the host machine is rebooted and the Secure Tunnel connector is checked by using the podman ps -a command, the Secure Tunnel connector container does not display running status.

If this issue occurs, the podman-restart service must be activated by using the systemctl command:

systemctl start podman-restart
systemctl enable podman-restart

After entering the command, check podman-restart worked by using the following command:

systemctl status podman-restart

If the Connector is still not running, try restarting the host machine.

Runbook Automation

When alert.suppressed value is used, runbook does not automatically run

Normally, you can select a runbook and configure it to run automatically: when an alert is converted to an incident, the runbook is assigned and runs automatically. However, if the parameter value alert.suppressed is used, the runbook does not run automatically as it reads this as a Boolean value rather than a string value. Therefore, it is necessary to manually run the runbook.

AIOps Insights

Number of Incidents reported in Noise reduction chart is inconsistent with number of Alerts and Events

In the AIOps Insights dashboard, the Noise reduction chart normally indicates the number of alerts, events, and incidents reported over a specific time period. However, the inclusion of historical data – containing longstanding, unresolved alerts – can result in a skewering of the data that is presented on the chart: the number of incidents that are presented can outnumber the number of alerts and events. Normally, this number is less than either – alerts reduce to a smaller number of events and events reduce to a smaller number of incidents.

The anomalous incident number happens because the reduction time frame covers alerts and events that are generated in the time period selected (for example, 7 Days). However, the incidents are generated from all outstanding alerts, including alerts that are not resolved, historically: alerts that occurred before the selected time period. So, in these circumstances, while the number of alerts and events is correct, the number of incidents is not.

AIOps Insights dashboard fails to load even when data is available

Large amounts of data can cause the dashboard to fail to load or time out with the message Error – Metrics unavailable displayed for each chart. The problem is a scaling issue. The AIOps Insights dashboard is not yet developed enough to handle huge amounts of data. A possible workaround is to increase resources for insights-api and elasticsearch pods. However, this approach might not be successful.

Events not showing up on Noise reduction chart

The charts in AIOps Insights cover a timeline no greater than 30 days. The dashboard reads the firstOccurenceTime value from only within that period. If an alert was created outside of that timeline, and deduplicated, it is not added to the eventCount in the AIOps Insights Noise reduction chart. In this scenario, the eventCount for the alert increments in IBM Cloud Pak for AIOps, but not in the Events segment of the Noise reduction chart.

Ticketing

IBM Change Risk Assessment tab in ServiceNow not displaying change risk assessment details

In a proactive ChatOps channel, you can click the Change request ticket URL to view its details in a ServiceNow instance. However, in some cases in version 4.1.1 of the ServiceNow App, details might not be displayed in the IBM Change Risk Assessment tab.

To avoid this issue, update the ServiceNow App to version 4.2.1 Opens in a new tab or higher.

Select incident state transitions are not permitted in ServiceNow

Cloud Pak for AIOps enforces certain state transitions for its incidents. To synchronize incident data with ServiceNow, the following state transition restrictions must be enforced in ServiceNow as well:

  • New -> On Hold
  • On Hold -> New
  • On Hold -> Resolved
  • On Hold -> Closed
  • On Hold -> Cancelled

An incident must be set to In Progress before it can go On Hold. An On Hold incident can only transition to In Progress.

Data synchronization in ServiceNow is not working consistently

If you have a ServiceNow integration, you can encounter an issue where updates to records in ServiceNow are not displaying for those incidents and alerts within Cloud Pak for AIOps.

To address this issue, run the following command from the namespace where Cloud Pak for AIOps is installed:

oc set env deployment/$(oc get deploy -l app.kubernetes.io/component=chatops-orchestrator -o jsonpath='{.items[*].metadata.name }') GUNICORN_TOTAL_WORKERS=1

If this issue occurs, review the following tasks to ensure your integration and processes are set up correctly: