Known issues and limitations

Review the known issues for IBM Cloud Pak® for Watson AIOps.

Additionally, review the troubleshooting documentation for more information on common issues. For more information, see troubleshooting.

Install and upgrade

OpenShift Container Platform 4.10 FIPS limitation

The ASM operator fails to create secrets on FIPS clusters for Red Hat OpenShift Container Platform Version 4.10.

Limitation on number of instances

IBM Cloud Pak for Watson AIOps and Infrastructure Automation can co-exist on the same cluster, but you cannot have multiple instances of IBM Cloud Pak for Watson AIOps or Infrastructure Automation on the same cluster.

Manual adjustments are not persisted

Custom patches, labels, and manual adjustments to IBM Cloud Pak for Watson AIOps resources (such as increased CPU and memory values) are lost when an event such as upgrade, pod restart, resource editing, or node restart triggers a reconciliation. Reconciliation causes any manually implemented adjustments to be reverted to their original default values. Depending on the parameters that you want to adjust, you might be able to use a custom profile to persist your changes. For more information about custom profiles, see Custom profiles.

Services fail to connect to Cassandra

After you install IBM Cloud Pak for Watson AIOps for a production environment deployment, various services might not be available due to connection issues with Cassandra. To resolve this issue if it occurs, restart Cassandra and the schema creation pods.

The ibm-aiops-orchestrator pod throws an OOMKilled error

If your environment has many secrets and ConfigMaps, when the ibm-aiops-orchestrator (lead operator) attempts to build its cache, the operator can exceed its memory allocation and cause a Kubernetes out-of-memory error for the container. This error can prevent the IBM Cloud Pak for Watson AIOps installation from reconciling, blocking the installation from completing.

If you encounter this issue, the operator requires more memory resources to build its cache. Override the subscription resource to increase the memory limits for the pod and avoid the out-of-memory issue.

Automatic approval required for installation

The use of manual approval strategies for InstallPlans in a project (namespace) can affect the IBM Cloud Pak for Watson AIOps installation.

For instance, if you use manual approval for any of your InstallPlans to install operators in All Namespaces mode (cluster scope), the manual approval can affect your install. The installation of IBM Cloud Pak for Watson AIOps requires automatic approval to be used.

Kong gateway pod is stuck in CrashLoopBackOff restarting issue

In some case, the Kong gateway pod might have problems reaching a ready state. If this issue occurs, the Kong gateway pod can get stuck in CrashLoopBackOff and keep restarting. If you check the Kong gateway pod, you can see an error message similar to the following message:

bind() to unix:/usr/local/kong/stream_rpc.sock failed (98: Address already in use)

This issue occurs due to the Kong gateway proxy container nginx having a problem. To resolve this issue, manually delete the Kong gateway pod with the following command:

oc delete pod gateway-kong-xxxxxxxxx-xxxxx

Where gateway-kong-xxxxxxxxx-xxxxx is the name of the pod.

Elasticsearch health status yellow after restoring from a backup

When you are restoring an Elasticsearch backup to a new single node Elasticsearch cluster, the Elasticsearch database might not work as expected. Instead, the Elasticsearch cluster health shows a yellow status after the restore completes.

ChatOps Microsoft Teams integration does not work with a proxy server

If you have an offline (air-gapped) deployment of IBM Cloud Pak for Watson AIOps or an environment that uses a proxy server, then you cannot use the ChatOps Microsoft Teams connection. The use of a proxy with the Chatops Microsoft Team connection is not supported.

Access control

Automation Analyst role unused in IBM Cloud Pak for Watson AIOps

By default, an Automation Analyst role is displayed within the IBM Cloud Pak Automation console Access control page when you are assigning a role to a user. This default role is used within the IBM® Automation family of offerings, which includes IBM Cloud Pak for Watson AIOps, however, this role is not used within IBM Cloud Pak for Watson AIOps.

This role does not include or provide any permissions within IBM Cloud Pak for Watson AIOps and should not be assigned to users within IBM Cloud Pak for Watson AIOps.

Service Administrators cannot manage others roles or view role details

Users with the Service Administrator role do not have permission to add or update a role, or view the details of a user's assign role. If a user with the Service Administrator role selects to view details about a role, a 401 error page is instead displayed.

Users from a user group remain after user group is deleted

When you delete a user group, the users that were included in the group remain in your list of users. Any role that is inherited through the deleted user group is removed from the users. If the users were assigned roles individually, they continue to have those roles and can continue to log in to the UI console and complete tasks. If the users that were in the deleted user group need to be removed completely, an administrator needs to manually remove the users. Users can be removed by clicking the Delete icon for the user's entry within the list of users on the Access control Users tab.

UI redirects to unexpected page when logging in after a session timeout

After a session timeout occurs and a user logs in to the UI console again, the user can be redirected to a different page than the page that they were on when the timeout occurred. For instance, a user that was working on the AI Model Management training page when their session timed out might be redirected to a graphql playground page after logging back in. This redirect occurs because the UI uses the last request URL that included the expired token to identify where to redirect the user when the user logs back in. If this redirect occurs, the user needs to manually return to the expected page in the UI to continue working.

Users in a user group are not listed under the Manage assignees pane of the Incidents and alerts page

When you have users within a user group and view the Manage assignees pane of the Incidents and alerts page, you might not see some users listed. This error can occur when the users from the LDAP user group are not individually onboarded. To verify whether a user is onboarded, go to the Access control > Users tab and check whether the user is listed. If the user is not listed, that user must first log into the console, which validates their roles and permissions. After logging in, the user can display in the list of users and on the Manage assignees pane.

The Manage assignees pane is viewable from the list of all Incidents. Select an incident and then click Manage assignees. After you select an existing user group, you should see the included users listed.

Observers and connections

ServiceNow Observer UI displays superfluous characters

If the Entity_mapping field is updated from the Observer UI, superfluous curly brackets and quotes are displayed in the Resource type mapping field.

Workaround: This is a cosmetic issue and can be ignored.

Scheduled job for ServiceNow observer fails after upgrade

Following the latest upgrade, a previously scheduled ServiceNow Observer job enters an Error state when it runs.

Cause: An observer job was started with tables that are no longer recognized after the upgrade.

Workaround:

  • On the Data and tools connections page, select the ServiceNow connection that was created before the upgrade and click Edit on the overflow menu.
  • Navigate to the Collect topology data (optional) page and specify the values for the 'ServiceNow tables to be discovered by observer' and 'Maximum number of tables per cmdb_ci_rel api batch call' fields.
  • Click Next, and then Save to save the changes. The job should now run correctly as scheduled.

'Failed to read certificate' error

This error can occur when an observer attempts to create an SSL certificate and the endpoint server does not respond.

In the example error message below, a vmvcenter.crt certificate error occurs because the endpoint server does not respond.

Failed to read certificate for field [certificate].
The file 'vmvcenter.crt' could not be found under /opt/ibm/netcool/asm/security/.

Workaround: Ensure the endpoint server is running correctly.

Duplication of resources in the topology if certain observer job parameters are changed after a job has been run

Certain resource parameters are used to uniquely identify a resource. If one of these parameters is changed after the initial job run, then any subsequent job run will result in duplicate records. For example, if the parameter of 'hostname' is replaced with 'Ipaddress' after a topology has been created, a subsequent discovery will consider the resource as new, and create a duplicate record.

The following resource parameters uniquely identify a resource. Changing them after the initial job has been run will result in duplicate records.

Workaround: If you need to modify these values, do not modify the existing job. Instead, create a new job.

Table. Observer job parameters
Observer Job parameter
ALM n/a
AppDynamics account
AWS region, dataTenant
Ansible AWX host, user
Azure data_center
BigFix Inventory data_center
Big Cloud Fabric proxy-hostname, proxy-username, bcf-controllers
Ciena Blue Planet data_center, tenant
Cisco ACI tenant_name
DNS addressTypes, server, port, recurse
Docker endPoint.port
Dynatrace datatenant, hostname
File provider, file
GitLab datatenant, hostname
GoogleCloud project_id
HPNFVD datacenter, username, cnf_job
IBM Cloud instance, username, region
ITNM instance
Jenkins jenkins_observation_namespace
Juniper CSO cso_central_ms_url, user_domain_name, domain_project_tenant_name
Juniper Contrail api_server_url, os_project_name, os_tenant_name
Kubernetes data_center, namespace
NewRelic accountName, accountId
OpenStack data_center, os_project_name
Rancher accessKey, clusterId
REST provider
SDC ONAP host, username
ServiceNow instance_url, username
SevOne datatenant, hostname
TADDM api_url, username
Viptela data_center
VMware NSX data_center
VMware vCenter data_center
Zabbix data_center

Incomplete historical data processing in the event of connector pods restarting

If you create a connection to collect historical data for Metric Anomaly AI Training, you might come across an issue where the connector pod restarts, but does not retrieve all historical data for training. As a result, you might suffer data loss.

A connector pod can restart due to outages, the target system crashing, or pod crashes in the environment. This issue can occur intermittently, depending on the number of metrics that are selected for the connection and the amount of data to be retrieved.

File and Rest observer topology service location URL not accessible

When creating an edge via either the File or Rest observers, the POST request returns a Topology service location URL that is not accessible. The URL cannot be used to manage the edge because the relevant API is not exposed. Workaround: None

Connector console displays special characters incorrectly

If you use special characters in the Name and Description fields while creating or editing a connection, the Connector console might display the special characters incorrectly. Nevertheless, the connection is saved.

Turbonomic integration affects other integrations in Turbonomic

The Turbonomic integration with IBM Cloud Pak for Watson AIOps enables actions that are created or executed in Turbonomic to be notified in AIOps through the enabled webhook workflow. However, Turbonomic allows only one webhook workflow per action. Therefore, other integrations that are enabled in Turbonomic, like ServiceNow, might not get any notification when actions are created or executed in Turbonomic.

No notification in Watson AIOps on Turbonomic actions closed without execution

Watson AIOps does not receive any notification from Turbonomic for actions that are closed without being executed. For example, an action related to an erroneous condition that is no longer occurring gets automatically closed in Turbonomic. But its corresponding AIOps alert remains open indefinitely and must be cleared manually from the console.

New Relic observer does not support dashboard tokens for new users

For new users of the New Relic observer, the observer does not work as it no longer supports the New Relic One dashboard token. However, it will continue to work for existing users who are using the old token that was generated previously through the old dashboard.

All dates and time are in US-en format

When you are scheduling data collection for a connection, all dates and times are presented in the US-en formats:

  • All dates are configured and presented in the mm/dd/yyyy format.
  • All times are configured and presented in the hh:mm AM/PM 12-hour clock format.

You cannot switch the date or time format.

Appdynamic historical start date and time cannot be older then 4 hours

Historical start date and time is configurable, but if you set it to beyond the past 4 hours, then the connector will ignore it and only retrieve the past 4 hours of data.

Appdynamic live mode aggregation interval is 1 minute

In live mode, the new aggregation interval allowed is only 1 minute.

The observer-service pod is in a crash back loop due to a ghost vertex

If you notice that the topology observer-service pod is not functioning correctly and that restarting the pod does not correct the issue, a ghost vertex might need to be removed. To remove the vertex, you need to traverse an edge to the vertex, and then delete the vertex. To traverse to the vertex, use the type vertex and definesType edge.

  1. Run the following command to find the ID for the type vertex.

    oc exec -it <topology pod> -- curl -X GET --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/types?_filter=keyIndexName=ASM::entityType::mgmtArtifact::ASM_OBSERVER_JOB' -u <username><password> --insecure
    

    Where

    • <username> - Your Topology AIOps API username
    • <password> - Your Topology AIOps API password
  2. Run the following command to use the definesType edge to get the ID for the vertex that is causing the issue.

    oc exec -it <topology pod> -- curl -X GET --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0'  'https://localhost:8080/1.0/topology/resources/<type vertex ID>/references/out/definesType' <username><password> --insecure
    

    Where

    • <type vertex ID> - The ID for the type vertex that you retrieved in step 1.
    • <username> - Your Topology AIOps API username
    • <password> - Your Topology AIOps API password
  3. Run the following command to delete the vertex.

    oc exec -it <topology pod> -- curl -X DELETE --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/resources/<type vertex ID>/references/out/definesType?_delete=nodes&_delete_self=false&_node_label=vertex' -u <username><password> --insecure
    

    Where

    • <type vertex ID> - The ID for the type vertex that you retrieved in step 1.
    • <username> - Your Topology AIOps API username
    • <password> - Your Topology AIOps API password

Changing the codec for a connection can cause errors

When you are creating or editing a connection with the Data and tools connections tool, avoid changing the codec property value that is set in the mapping field. This property is not a configurable property. The IBM Cloud Pak Automation console sets the correct codec for the connection. If you change the value, the connection might fail, or it might not retrieve data for IBM Cloud Pak for Watson AIOps. If you do need to set or change the value, ensure that you use the correct codec for the connection, such as splunk for a Splunk connection, elk for an ELK connection, or Falcon LogScale for a Falcon LogScale connection.

Search on the Add connection page does not function as expected

When you search on the Add connections page, you can view connectors that exist for other categories than the category under which you are searching. The search shows connectors that match the search query, regardless of the category of the connector.

Delay query time for connections

If you set a connection to retrieve Live data for continuous AI training and anomaly detection or Live data for initial AI training, you might need to configure a delay to offset the query time window to provide a time buffer for preventing the partial retrieval of real-time data. You cannot configure this delay within the UI console. You must use a command line to configure the delay. For more information about configuring this delay, see Delay configuration in data connections.

Alerts for Instana without associated topologies

In some cases Instana alerts will not have associated topologies. In most cases this happens because the resource that originated the event is no longer available in Instana. For example, a pod that triggers an Instana event can be redeployed by the underlying kubernetes engine.

Alerts for Instana topology not mapping properly

In some cases alerts are not mapped correctly to a corresponding Instana topology node. For example, alerts generated from log anomaly detection or metric anomaly detection (or other sources) might not show as associated with a Instana topology node.

As a workaround, you need to define your own match rules to correlate with the source data. To define a match rule, click on Resource Management, then click Settings, Topology Configuration, and finally Configure on the Rules Tile. When you are configuring the match token values to use, the values depend on the data that you are sending to Instana.

Instana metric collection API rate limit exceeded error

The recommended rate limit will be double the number of resources. There might be situations where the limits need to be increased. When a more precise limit is required, use the following formula to determine the limit to use:

number-of-metric-API-calls-per-hour ~= (number-of-selected-technologies x 2) x (snapshots-for-selected-technologies / 30) x (60 / collection-interval)
number-of-topology-API-calls-per-hour ~= (number-of-application-perspectives x (60 / collection-interval)) +(number-of-services x (60 / collection-interval))
number-of-events-API-calls-per-hour = 60

total= number-of-metric-API-calls-per-hour + number-of-topology-API-calls-per-hour + number-of-events-API-calls-per-hour

Note: Each plugin can have a different number of metrics collected. The mean value across these is used, which is 2 collection cycles per plugin. If the environment is unbalanced, for instance you have mostly hosts that define most metrics, then the formula might underestimate the required limit.

To determine the number of resources (snapshots) for each infrastructure pluginAPI, use the following:

api/infrastructure-monitoring/snapshots?plugin=technology_name

Example:

api/infrastructure-monitoring/snapshots?plugin=host

For more information about Instana APIs, see Instana API.

The following example curl commands allow you to retrieve the number of:

  • snapshots-for-selected-technologies (such as host)

    curl -k -s --request GET 'https://<instana server hostname>/api/infrastructure-monitoring/snapshots?plugin=host' --header 'Authorization: apiToken <api token>' | jq '.items|length'
    
  • number-of-application-perspectives

    curl -k -s --request GET 'https://<instana server hostname>/api/application-monitoring/applications' --header 'Authorization: apiToken <api token>' | jq '.items|length'
    
  • number-of-services

    curl -k -s --request GET 'https://<instana server hostname>/api/application-monitoring/services' --header 'Authorization: apiToken <api token>' | jq '.items|length'
    

IBM Cloud Pak for Watson AIOps cannot close some alerts when an Instana connection exists

If you have an Instana connection created and have Instana 221 (Saas or Self-Hosted), you might encounter an issue where IBM Cloud Pak for Watson AIOps might not be able to close some alerts. Instead, you might need to check the status for the associated event within the Instana dashboard and clear the alert manually. For more information, see Troubleshooting connections: Instana Event integration.

When creating a Kubernetes observer job that is named 'weave_scope', 'load', 'kubeconfig', or 'local', the job fails to run

If you create a Kubernetes observer job with the name weave_scope, load, kubeconfig, or local, the job always fails to run. When this error occurs, you can view an error icon in the schedule column for the job. To avoid this issue, do not use these names for the observer job.

Log Anomaly 8k limit on field mapping in details field of the alert schema

The limitation is that the datalayer imposes an 8 kb size limit on the details field in the alert schema. The details field is populated by the log anomaly event, which provides the relevant information to display in slack when trying to view alerts in the chatops. Whenever the details field size exceeds 8 kb, the returned json object is truncated and therefore when the user clicks view alerts to retrieve the alerts related to an incident, expected results are not seen and an error is recorded.

The current fields under the details objects are:

 end_timestamp: int
 original_group_id: str
 causality: dict
 detected_at: float
 source_application_id: str
 log_anomaly_confidence: float
 log_anomaly_model: List[str]
 prediction_error: dict
 error_templates: List[int]
 count_vector: List[int]
 text_dict: dict
 application_group_id: str
 application_id: str
 model_version: str
 severity_from_model: int
 description: str

Log connection does not start when multiple log connections are active

If you have many active log connections, such multiple Kafka, ELK, Splunk, and Falcon LogScale connections, and you create and enable another Falcon LogScale connection, you might notice that the connection status is stuck in an error or restarting state. This state can even occur after the connection is operating as expected.

This can occur if you exceed the limit for the number of jobs that can run on the underlying service, which results in insufficent resource available to start the connector. To resolve this issue, complete one or more of the following tasks:

  • Increase the replica count of your task managers.
  • Increase the task manager count per replica.
  • Change the parallelism of your connections.
  • Cancel other connections

After a restore, data from connections are not processed

If you have a connection that you are restoring, the status for these connections can be in error after the restore process completes. To resolve this status, you need to edit and save your connections with the Data and tool connections in the IBM Cloud Pak for Watson AIOps IBM Cloud Pak Automation console. Editing the connection regenerates the associated Flink job for the connection, which updates the status.

Turning on a disabled Dynatrace connection that collects live data results in an error

If you have a connection to Dynatrace enabled for live data collection and then disable the connection, enabling the connection again can result in a java.lang.NullPointerException error occurring. If this error occurs, delete and then create the connection again to enable the Dynatrace data collection.

Historical data from ServiceNow instance gets collected only when the historical data flow is reenabled

If you enable historical data flow for a ServiceNow connection, you might notice that the historical data is not collected from ServiceNow. For instance, when you check the grpc-snow pod, you can see ticket data available, but when you check the Flink job or in Elasticsearch, you can notice that no data was collected. If this issue occurs, turning off the historical data flow and turning it back on can cause the data to begin to be collected.

Cannot edit or disable AppDynamic or Dynatrace connection due to 400 Bad Request error

If you have created an AppDynamic or Dynatrace connection, you might not be able to disable the connection or edit the collection mode, such as to change the mode from historical to live or live to historical. If this issue occurs, a 400 Bad Request error message displays when you attempt to disable or edit the connection. Instead of disabling the connection, delete the connection and create a replacement connection when needed. As a workaround if you cannot edit the connection, you can create a replacement connection with your preferred settings.

Dynatrace connector pod restarted and does not retrieve all historical data

If you have a Dynatrace connection created and pull historical data with multiple metrics for Metric Anomaly AI Training, you can encounter an issue where the Dynatrace pod restarts, but does not complete retrieving the expected historical data for training. This issue can occur intermittently, depending on the number of metrics that are selected for the connection and the amount of data to be retrieved.

If this potential out-of-memory or out-of-resources issue occurs, consider creating separate connections to monitor different and smaller sets of metrics. By splitting the connections, you can reduce the amount of data to be retrieved through the initial connection that can cause this issue.

AppDynamics and Dynatrace: unable to create a connector with description or special characters in name

You cannot use spaces or special characters for names for AppDynamics and Dynatrace connectors. You can only use alphanumeric values.

ServiceNow user account gets locked out after a few hours

If there is an active ServiceNow connection with data collection enabled and the ServiceNow credentials change, the ServiceNow user account can get locked out. ServiceNow has an automatic login locking script called "SNC User Lockout Check", which locks users out after more than 5 failed attempts (including any failed API calls).

If you check the Incidents and alert page, you will see also an alert saying "ServiceNow instance authentication failed".

When this problem occurs, unlock the user in ServiceNow. Then change the password in the ServiceNow connection and save. When authentication fails in the ServiceNow connector, there is a 1-minute wait time before you can access it, to prevent a lockout from occuring quickly.

Scale resources when running log anomaly training on large data

In some cases it is observed that log anomaly training fails on large data due to being Out Of Memory (OOM) or if there is a problem with ES shards. The solution is to scale up the resources to handle large data training.

For more information about shard management, see About indices and shards. For more information about increasing ES Resources, see Log anomaly training pods CPU and Memory resource management.

The connection status for Elk, Custom Logs, LogDNA, and Falcon LogScale sometimes shows 'not running' even though the Flink job and gRPC pod are running correctly

On creating a connection, the Flink job retrieves data normally and the gRPC pod is running without error. However, the console shows that the connection status is 'not running'.

Log data connections status is "Done" even though historical data is still loading

When a log data connection (Falcon Logscale, ELK, LogDNA, Custom, Splunk) is running in Historical data for initial AI training mode, and a custom regex is added in the field_mapping section, the data processing can take a long time. Although the Data collection status might be shown on the UI as Done, data could still be being processed and written to Elastic in the background.

To speed up this process, you can increase the Base parallelism number that is associated with that connection. For more information, see Increasing data streaming capacity.

IBM Tivoli Netcool/Impact connection stops event processing with exceptions

If you have an IBM Tivoli Netcool/Impact connection, you can encounter an issue where the connection temporarily stops processing during the sending of an event stream to IBM Cloud Pak for Watson AIOps.

This issue can occur when you have an IBM Cloud Pak for Watson AIOps policy that triggers an IBM Tivoli Netcool/Impact policy when certain types of events are received. If this issue occurs and stops the event processing, the Impact connector logs or Impact policylogger logs can include messages that are similar to the following example exceptions:

[6/14/23, 11:38:45:816 UTC] 0000005d ConnectorMana W failed to send status update
...
[6/14/23, 11:38:45:815 UTC] 000023ca StandardConne W configuration stream terminated with an error
...
[6/14/23, 11:38:45:816 UTC] 000023cc GRPCCloudEven W consume stream terminated with an error: channel=cp4waiops-cartridge.lifecycle.output.connector-requests

If you encounter this issue, wait for the issue to resolve. This issue resolves itself over time to invoke the policy and begin to process the event stream again.

Impact connector fails for Impact server with non-default cluster name

If the Netcool/Impact cluster uses a non-default cluster name ("NCICLUSTER"), the connector may fail to validate the connection. The Impact server may report DynamicBindingException errors in the impactgui.log:

com.micromuse.common.nameserver.DynamicBindingException: DynamicBindingException: Service [NCICLUSTER] not in nameserver.

To resolve the issue, wait for the backend Impact server to finish initializing before starting or restarting the Impact GUI server.

If Netcool/Impact is running fix pack 7.1.0.26 or later, you can also resolve the issue by setting the nameserver.defaultcluster property in the GUI server. Add the following line to $IMPACT_HOME/etc/nameserver.props:

impact.nameserver.defaultcluster=CLUSTERNAME

where CLUSTERNAME is the name of the Impact cluster.

IBM Cloud Pak for Watson AIOps and Netcool Operations Insights

IBM Cloud Pak for Watson AIOps Strimzi Kafka topics created without replication

IBM Cloud Pak for Watson AIOps supports multiple replications of Kafka topics for large production installations, such as for data redundancy. The IBM Cloud Pak Automation console can automatically create Kafka topics when connections are created. When a topic is dynamically created by the IBM Cloud Pak Automation console, the replication is set to 1 in the controller. As such, Kafka topics the are created during installation can have multiple replicates, but those topics that are created dynamically do not.

If you are implementing a production (large) deployment of IBM Cloud Pak for Watson AIOps, you might lose data if your Kafka pods fail or restart. If the data flow is enabled in your Kafka connection when the Kafka pods go down, you might experience a gap in the data that your connection generated during that down period. Upgrades or updates to workers can cause a Kafka broker restart.

You can manually modify the Kafka topic replication inside the Kafka container from a value of 1 to 3 to mitigate any potential data loss from this issue.

IBM Cloud Pak for Watson AIOps pods not starting after a cluster restart

When the cluster is restarted, all nodes have a STATUS of Ready. Most pods return to a STATUS of Running except for some IBM Cloud Pak for Watson AIOps pods.

One potential cause is that Elasticsearch must be up and running before the IBM Cloud Pak for Watson AIOps pods can start.

Restart the Elasticsearch pod to get all pods back to a STATUS of Running.

Error when accessing the AI Model Management

The AI Model Management can fail to load when you click to open the tool from the Quick navigation links on the Home page.

This can occur if a network disruption occurred during installation. This disruption can result in the cluster becoming inaccessible and cause some steps to be missed during the platform UI startup. This disruption can result in expected Nginx rules to be missing.

To check whether the rules are missing, complete the following steps:

  1. Open a command line and connect to your cluster with the oc login command.

  2. Use the oc project command to set the context to the project where IBM Cloud Pak for Watson AIOps is deployed.

  3. Locate an ibm-nginx pod within your deployment:

    ibm_nginx_pod=$(oc get pods | grep ibm-nginx | head -1 | cut -f1 -d\ )
    echo $ibm_nginx_pod
    
  4. Run the following oc exec command for one of the ibm-nginx pods to run commands against that pod.

    oc exec $ibm_nginx_pod -- nginx -T | grep aimodel
    

    After the command runs, check whether there is an ingress rule or entry for aimodels. Your output can resemble the following sample output:

     	               location ~* (/aiops/([^/]+)/aimodels|/aiops/aimodels) {
    nginx: the configuration file /usr/local/openresty/nginx/conf/nginx.conf syntax is ok
    nginx: configuration file /usr/local/openresty/nginx/conf/nginx.conf test is successful
    

    If you do not view an aimodels rule, complete the following steps to add the rule.

  5. Open the platform UI extension configuration for the AI Model Management for editing:

    oc edit cm aiops-ai-model-ui-zen-extension
    
  6. Increment the icpdata_addon_version version metadata label within the configmap to be 3.3.2.

    icpdata_addon_version: 3.3.2
    

NOIHybrid incorrectly listed in provided APIs for IBM Netcool Operations Insight operator

NOIHybrid is incorrectly included in the Provided APIs list for the Netcool Operations Insight operator. This list is displayed in the Red Hat OpenShift Container Platform web console under Installed Operators > Netcool Operations Insight > Operator Details. Do not use NOIHybrid APIs.

Rare issue: unable to deploy log anomaly detection model

Very occasionally, following successful completion of log anomaly detection training, an error similar to the following is displayed in the AI management UI training page following an attempt to deploy the model.

Error
Model deployment failed

Within the error textbox, you will also see the text "Forbidden".

If you investigate the aiops-ai-model-ui pod logs, you will also see the following error.

ForbiddenError: invalid csrf token

If this occurs, first refresh the browser and try to deploy again.

If that does not remedy the situation, then log out and log back in, and then try to deploy the model again.

Applications and topologies

JVM heap out-of-memory (OOM) failures when loading large number of resources

When running topology loads in quick succession, it is possible to experience some OOM errors and undesired topology pod restarts, even though the pods will continue the processing after restarting.

This error can occur when running resource loads of several millions in a large deployment, and could slow down the loading process. The following type of error message can be seen in the pod logs:

WARN [2022-10-25 15:43:31,906] [JanusGraph Session-io-4] c.d.o.d.i.c.c.CqlRequestHandler - Query ‘[4 values] SELECT column1,value FROM janusgraph.graphindex WHERE key=:key AND column1>=:slicestart AND column1<:sliceend LIMIT :maxrows [key=0x02168910cfd95b7e3bc74006a4a8a73a79c71255a0726573...<truncated>, slicestart=0x00, sliceend=0xff, maxrows=2147483647]’ generated server side warning(s): Read 5000 live rows and 1711 tombstone cells for query SELECT value FROM janusgraph.graphindex WHERE key = 02168910cfd95b7e3bc74006a4a8a73a79c71255a07265736f757263e5 AND column1 > 003924701180012871590290012871500290 AND column1 < ff LIMIT 5000; token 9157578746393928897 (see tombstone_warn_threshold) JVMDUMP039I Processing dump event “systhrow”, detail “java/lang/OutOfMemoryError” at 2022/10/25 15:43:32 - please wait. JVMDUMP032I JVM requested System dump using ‘/tmp/cassandra-certs/core.20221025.154332.1.0001.dmp’ in response to an event

Cause: Not enough headroom exists between JVM memory limit and the pod memory limit, usually because one was increased without also increasing the other.

Workaround: Ensure that any changes in heap size maintain enough headroom between these settings.

Example: In this example (for a topology size1) the pod limits are set to 3.6 GB while the maximum memory for the JVM (-Xmx) is set to 3 GB, thereby leaving 0.6 GB of headroom free for use by the OS.

size1:
   enableHPA: false
   replicas: 2
   jvmArgs: "-Dcom.ibm.jsse2.overrideDefaultTLS=true -Xms1G -Xmx3G"
   resources:
      requests:
         memory: "1200Mi"
         cpu: "2.0"
      limits:
         memory: "3600Mi"
         cpu: "3.0"

Critical error message displayed when attempting to render an application

This problem occurs when all of the groups, within the application you are attempting to render, have no members. When the application is selected in Application management, it does not render and a critical error message is displayed on the UI.

Avoid creating applications with no members. If an application with no members was created for test purposes only, then ignore this error.

Different date and time in Automation console and ChatOps between users

The date and time format for an Incident in the IBM Cloud Pak Automation console Application management tool and the associated ChatOps notification can be different between users. The format and time zone that is used in the Automation console and ChatOps notification is set to the user's locale. If different users are in different time zones, the displayed date and time are different in the Automation console and ChatOps notification.

Deleting a tag template can cause out-of-memory errors

If a tag is applied to a large number (that is, thousands) of topology resources, then deleting the tag template can cause out-of-memory errors with the topology-merge pod.

Avoid creating tag templates that use tags that occur with such frequency. Later, do not tag thousands of resources with the same tag, and avoid them being used in a group.

The Find Path tool ignores filters

The topology path tool fails to launch with filters applied.

Launch the path tool without filters, then manually apply the filter settings on the path page.

Probable cause is not producing accurate results

The correlation algorithms for probable cause currently require the use of a Kubernetes model with service-to-service relationships, or the use of dependency relationships between non-Kubernetes resources.

Complete the following steps to create the required relationships for Kubernetes. This procedure configures Topology Manager to overlay relationships provided by the File observer onto the Kubernetes topology.

Note: The Kubernetes observer must be configured and loading data.

  1. Log in to the IBM Cloud Pak Automation console.

  2. From the main navigation, expand Operate and click Topology viewer.

  3. From the topology navigation toolbar, expand Settings, click Topology configuration.

  4. On the Rules tile, click Configure to navigate to the Rules administration page.

  5. On the Merge tab, click New to create a New merge rule.

    In this scenario, data that is provided by the File observer will be used to add the relationships.

  6. Specify the following information on the New merge rule page:

    1. Rule name: k8-file-service.

    2. Set the rule Status to Enabled.

    3. Add the uniqueId property to the set of Tokens.

    4. Expand the Conditions section and select File and Kubernetes from the set of available Observers and click Add.

    5. Specify service for Resource types and click Add.

    6. Click Save to save the new Merge rule.

  7. Locate the services that you want to relate together and make a note of their source-specific uniqueId, such as 05f337a1-5783-43bb-9323-dfba941455c7 (shipping) and ae076382-3df9-46cb-97e9-a0342d219efb (web).

  8. Create a file for the File Observer that contains the service-dependsOn-service relationships necessary for the correlation algorithms to work.

    The following example creates two services, web and shipping, and states that web dependsOn shipping. Repeat this as required to relate your services together.

    V:{"uniqueId": "05f337a1-5783-43bb-9323-dfba941455c7", "name": "shipping", "entityTypes": ["service"]}
    V:{"uniqueId": "ae076382-3df9-46cb-97e9-a0342d219efb", "name": "web", "entityTypes": ["service"]}
    E:{"_fromUniqueId":"ae076382-3df9-46cb-97e9-a0342d219efb", "_edgeType":"dependsOn", "_toUniqueId":"05f337a1-5783-43bb-9323-dfba941455c7”}
    
  9. Load this file into Topology Manager to relate the services. For more information, see Configuring File Observer jobs.

    If your topology changes, then re-create and reload the file as required. A similar process can be followed for non-Kubernetes sources.

High volumes of data can cause Spark workers to run out of memory

If your environment handles high (10+ millions) workloads of alerts or event, your Spark workers can run out of storage (ephemeral storage). If you encounter this issue, restart the affected Spark workers. This issue can also occur if you are running multiple jobs, which can cause the file system to fill, such as with log or JAR files.

Azure observer missing subnet relationship in topology

For the Azure Observer, a subnet can be intermittently missing the relationship with an IP address in the topology for a resource. While the relationship can be intermittently missing, both the subnet and IP address verticies remain available in the topology.

Topologies not visible on Incident Topology page after resource merge

When resources from two observer sources have been merged using the topology merge functionality, the topology is no longer displayed in the Incident view. This known issue affects only the Incident view, and the topology is still present in all other views.

Openstack observer missing edge-runsOn connectivity in topology

After running an Openstack observer job, the edge-runsOn connectivity between ComputeHost and Hypervisor elements is not shown in Resource Managenent -> Resources when it should be.

Infrastructure Automation

Kubernetes permissions are missing for user roles for using Managed services and the Service catalog

If you install Infrastructure Automation, you, or an administrator, must add the required Kubernetes permissions to user roles before your users can begin to access and use Managed services or the Service catalog.

As an administrator, add the following permissions to your use roles:

Table. Required permissions
Role Required permission for Infrastructure Automation
Automation Administrator Administer Kubernetes resources
Automation Operator Manage Kubernetes resources
Automation Developer Edit Kubernetes resources
Automation Analyst View kubernetes resources

For more information about how to add permissions to a role, see Managing roles for Infrastructure Automation.

Non-LDAP users cannot access Infrastructure management

Non-LDAP authenticated users cannot be used with Single Sign-On for Infrastructure management. If you attempt to use Infrastructure management with a non-LDAP authenticated user, you can encounter the following error:

While logged in to the Infrastructure Automation UI console with a non-LDAP user, attempting to start Infrastructure management fails with an error. This is a limitation.

The error states:

OpenID Connect Provider error: Error in handling response type.

Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core are not supported

Installation of Infrastructure management in IBM Cloud Pak for Watson AIOps does not support Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core. You can continue to use the Kubernetes cluster life-cycle templates and services to create a Kubernetes cluster and import the cluster to an existing installation of Red Hat Advanced Cluster Management, if an installation is available. Deploying hybrid applications are also not supported by Infrastructure Automation.

Users are redirected to the Administration panel when logging back into the UI

When you are working within Infrastructure Automation and log out and then log back in, you can be redirected to the Administration panel instead of the Infrastructure Automation home page or other page that you were previously using. If this occurs, you can use the Cloud Pak switcher in the upper right of the UI console to switch to the Infrastructure Automation home page and then return to the page that you were previously using.

Database fails to reset when error occurs during database creation for Infrastructure management

If you are creating the database for the Infrastructure management appliance and you encounter an error, such as the database creation failing to complete successfully, you might not be able to continue with your setup without redeploying. For instance, if the creation fails, resetting the database to clean up your database and deployment can also fail. To resolve this issue, you need to redeploy the Infrastructure management appliance image before reattempting to create the database.

The cam-tenant-api pod is not in a ready state after installing the iaconfig CR

After you install Infrastructure Automation, you can encounter an error where the cam-tenant-api pod displays as running, but not in a ready state. When this error occurs, you can see the following message:

[ERROR] init-platform-security - >>>>>>>>>> Failed to configure Platform Security. Will retry in 60 seconds <<<<<<<<<<<<< OperationalError: [object Object]

If this error occurs, delete the cam-tenant-api pod to cause the pod to restart and attempt to enter a ready state.

Print or export as PDF entire tables does not work as expected

If you are using the Firefox browser, and you select Print or export as PDF on the Compute > Containers > Projects page to print or export the entire table of data, the print, or export might not work as expected. For instance, some data, such as table rows might be missing. If you encounter this issue, try a different browser for printing or exporting the data.

Infrastructure management log display in the UI is removed

Log display support on the UI is removed for Infrastructure management. As an alternative for viewing these logs, use Kubernetes standard methods such as oc log commands, viewing the output in Red Hat OpenShift Container Platform or Kubernetes, or setting up a log aggregator for your cluster.

You can still see the log tabs (Collect Logs, IA:IM Log, Audit Log, and Production Log) on the Settings > Application Settings Diagnostic page. However, instead of displaying the log information, the following message is displayed: Logs for this IA:IM Server are not available for viewing.

Infrastructure Automation Test deploy fails

Infrastructure Automation Test deploy from a Service Overview page fails to deploy.

On Infrastructure management appliances, an Ansible playbook deployment fails

When you attempt to deploy an Ansible playbook on an Infrastructure management appliance through an embedded Ansible deployment, the playbook deployment can fail with the following error:

<35.237.119.31> ESTABLISH SSH CONNECTION FOR USER: ubuntu
fatal: [35.237.119.31]: FAILED! => {
"msg": "Unable to create local directories(/home/manageiq/.ansible/cp): [Errno 13] Permission denied: b'/home/manageiq'"
}

If you encounter this error, log in to the appliance as the root user and then deploy the playbook again:

  1. Run the command:

    mkdir -p /home/manageiq
    
  2. Run the command:

    chown manageiq:manageiq /home/manageiq
    
  3. Deploy the Ansible playbook again.

After restoring from a backup, the Managed services deployment can fail

After you restore Managed services (cam) from a backup, the deployment instance can fail with a socket hang up error.

If this error occurs, restart the cam-iaas pod by running the following command:

oc delete pod <cam-iaas-xxxx> -n <namespace>

Where <namespace> is the project (namespace) where Infrastructure Automation is installed, and <cam-iaas-xxxx> is the name of the cam-iaas pod to restart.

With this restart, the service deployment can complete succeessfully.

UI console

Tour icon disappears when browsing to another console page while a guided tour is running

When you start a guided tour and navigate away from the page to the IBM Automation Home page, the Tour icon might not display on the toolbar. This behavior can occur when an IBM Cloud Pak for Watson AIOps tour is still running. Only one guided tour can run at a time. To resolve this issue, return to an IBM Cloud Pak for Watson AIOps page, click on the Tour icon and close a tour. When you return to the IBM Automation Home page, the Tour icon reappears.

About page does not show the correct version of IBM Automation

If you are attempting to identify the version of an installed IBM Cloud Pak for Watson AIOps, the About page that is accessed from the console toolbar does not show the correct version.

To determine the correct version, use the Cloud pak switcher to access the IBM Cloud Pak | Administration tool. On this tool, find the Cloud Pak deployment summary card and click View details. The side panel opens. Expand the entry for IBM Automation to view the installed instances. The details for the installed instance shows the current deployed version of the IBM Cloud Pak.

'Data cannot be displayed': error given for 'Defined Applications' and 'Favorite Applications' tiles on the Home page

On opening the Console home page neither 'Defined Applications' nor 'Favourite Applications' is listed in their tiles. However, they do exist as they can be viewed under the 'Resource Management' section. The fix for this is to restart the aiops-base-ui pods.

User credential timeout starts in the backend

The first indications of a user credential timeout might be backend failures. For example, failure to load incident, alert, or policy lists. To resolve this issue if it occurs, log out and log back in again, or wait for the frontend logout to occur.

AI Model management and training

Log parsing assigns messages to catch-all template instead of generating expected template

If you use catch-all templates for mapping uncategorized messages during AI model training, you can encounter an issue where the log parsing assigns messages for an error to the catch-all templates instead of generating an expected template for that error. If this issue occurs, you might not see expected anomalies.

If you suspect this issue is occurring and you do not see expected anomalies, complete the following steps to manually verify your training templates, and remove any catch-all templates that incorrectly generated.

  1. Retrieve the normalized logs from your logtrain indices.

  2. Identify the logs that are error logs. Review those logs to determine the template mappings.

  3. Retrieve the identified templates from Elasticsearch.

  4. Use the error log contents and the template ID from the retrieved normalized logs to identify the template string within the retrieved templates.

  5. If the template string is comprised entirely of parameters, or a single word and parameters, the template might be a catch-all template. For example, the following string is an example of a catch-all template:

    <>to <><><><><>
    <> <> <> <>-<>-<> <> <> <> <> <> <> <> <> <> <>
    
  6. Manually delete any catch-all templates.

Elasticsearch record count does not match record count published to Kafka topic

When you push a large training file (for example 60 M records, such as logs or events) to Kafka through your configured connection, the number of records that are ingested and displayed on Elasticsearch might not match. Elasticsearch record count might be lower than Kafka count due to deduplication. If you encounter this issue, split large files into smaller batches and send them individually to Kafka (for example, 5 M records each). When you are pushing a batch, ensure that you wait for an ingest to complete and the associated records display on Elasticsearch before you push the next batch of records.

Log anomaly detection EXPIRY_SECONDS environment variable not retained after an upgrade

If you set a value for the EXPIRY_SECONDS environment variable and upgrade, the environment variable is not retained after the upgrade.

After the upgrade is completed, set the environment variable again. For more information about setting the variable, see Configuring expiry time for log anomaly detection alerts.

Log anomalies are not detected by natural language log anomaly detection algorithm

In some cases a model that has been trained successfully is unable to detect certain log anomalies. The quality of the model is independent of whether it trained successfully, and model quality tends to improve as more training data is available. If the model is not detecting anomalies in your logs, consider training the model again but using additional days of training data to improve the model quality.

Metric anomaly detection training does not run on schedule

If you have metric anomaly detection training scheduled to run, such as daily, you can encounter an issue where the training does not run as scheduled. If the training job does not run on schedule, log in to the IBM Cloud Pak Automation console and click the Metric anomaly detection algorithm tile and then Train models.

In Change risk training, Precheck indicates “Good data” but models fail to create

On rare occasions, a Change risk model fails to create, even though Precheck data indicates that the data is good. This failure is caused by an insufficient number of problematic change risk tickets being available to create a good model. This problem resolves itself when enough tickets become available for the model. (For more information, see Closed change ticket count requirements).

To confirm that insufficient problem tickets is causing the failure, view the Change risk logs on the Luigi pods.

To retrieve the pod:

oc get pod | grep luigi-cr

View the logs for training Change risk models.

oc logs <pod-name> # Ex: luigi-cr-1b5ef57f-9053-4037-95ca-c1e8b8748fc5

Check whether the log contains the following

size of the problematic (aka labels) tickets is insufficient

If confirmed, ensure that enough problematic change tickets are available before training the model again.

ChatOps

In Slack, a ChatOps communication to IBM Cloud Pak for Watson AIOps times out without establishing a connection

When sending a ChatOps communication from Slack to IBM Cloud Pak for Watson AIOps, a known intermittent issue can occur. The communication between Slack and IBM Cloud Pak for Watson AIOps can time out after 3 seconds of no response. A potential solution is to reconfigure your connection in the connection onboarding. Ensure that your Slack Bot can access your IBM Cloud Pak for Watson AIOps instance in a prompt fashion. A more permanent and robust solution to this issue is being devised.

In a Microsoft Teams ChatOps, the attach template logs feature does not work

If you have a Microsoft Teams ChatOps, clicking the Attach template logs button does not work and the logs are not sent to your Microsoft Teams channel for review. As an alternative, use the Preview logs button to view the template logs.

Incidents cannot be reopened or restored in ChatOps

When an incident is closed it is archived and can no longer be modified. If a new alert occurs that is related to the archived incident, a new incident is created instead of reopening the archived incident.

No Recommended runbooks found in incident overview

If a runbook recommends to remediate an incident is deleted from the runbook Library, it does not remove the Recommended runbooks link in a ChatOps notification. This can result in the ChatOps runbook section linking out to an empty runbooks page in the incident overview.

No incident content viewable in Microsoft Teams on mobile devices

On mobile devices, when viewed in Microsoft Teams, incidents can appear with no viewable data. Where this happens, switch to using a computer to see the full incident data.

IBM Change Risk Assessment tab in ServiceNow not displaying change risk assessment details

In a proactive ChatOps channel, you can click the Change request ticket URL to view its details in a ServiceNow instance. In some cases, there might be no details displayed in the IBM Change Risk Assessment tab.

Updated ChatOps connection fails

When the credentials for a ChatOps connection are updated, a failure in the caching mechanism can cause the old app credentials to be used rather than new credentials.

  • If this issue occurs for a Slack ChatOps connection, a channel_not_found error displays for the connections.
  • If this issue occurs for a Microsoft Teams ChatOps connection, a the bot is not part of the conversation roster error displays for the connections.

To use the updated credentials, restart the chatops-integrator pod by running the following command in the IBM Cloud Pak for Watson AIOps project (namespace):

For a Slack connection:

oc rollout restart deployment $(oc get deploy -l app.kubernetes.io/component=chatops-slack-integrator -o jsonpath='{.items[*].metadata.name }')  

For a Microsoft Teams connection:

oc rollout restart deployment $(oc get deploy -l app.kubernetes.io/component=chatops-teams-integrator -o jsonpath='{.items[*].metadata.name }')

Once the updated pod is running again, the new credentials are used. If errors continue, check that the Slack or Microsoft Teams application is a channel member of the channel ID that was input into the ChatOps connection form.

Incidents and alerts

"An error occurred while fetching data from the server"

When viewing the Incidents UI or when creating a policy to assign runbooks to alerts, you might see the message "An error occurred while fetching data from the server". Or in the Alert Viewer, you might see "An unknown error occurred". If you encounter these error messages, complete the following steps to delete the aiops-ir-ui-api-graphql pod. The pod is then automatically re-created, which should resolve the error.

  1. Log in to your cluster by running the oc login command.

    oc login -u kubeadmin -p <password>
    
  2. Delete the aiops-ir-ui-api-graphql pod.

    oc delete pod -l component=aiops-ir-ui-api-graphql -n <cp4waiops_namespace>
    
  3. Wait for the pod to restart.

Alerts tab shows "An unknown error occurred" error when all alerts are closed

If you are viewing a closed incident that has all associated alerts resolved, you can encounter an error when you view the Alerts tab. This "An unknown error occurred" error displays when there are no associated alerts. You can ignore this error message as the incident and alerts are resolved and closed.

In the metric anomaly details chart PNG and/or JPG export does not appear to work

This can occur intermittently, especially in the Firefox browser. The export time can be slower than normal so it could be that it takes a little longer for the export to complete. However, if nothing happens, try again later. Alternatively, use a different browser such as Google Chrome.

Same runbook status for multiple alerts in an incident

If more than one alert meets a runbook policy's conditions, the same runbook can be assigned to multiple alerts in an incident. From the incident overview page, you can select an alert and run an associated runbook. The runbook Status of the selected alert will be updated on the UI. However, the runbook status of other alerts in the incident might be updated with the same status. This is a known issue.

Closed incidents are missing details and displaying a critical error

Incidents with status of "Closed" are missing topology information on the incident Overview tab. The associated alerts of the closed incidents are also missing from Alerts tab. The following critical error is displayed on the Topology tab for closed incidents: "No resource exists with the specified identifier and time point".

Metric anomaly chart unexpectedly changes from zoom view to normal view

This can occur when a related alert is selected and added to a chart you are zoomed in on. The chart resets to normal view. To resolve, zoom in again after related alert is added to chart.

Unable to add metric anomaly in Related alerts to chart

In some cases, when you click the checkbox in the Related alerts, it does not add the related anomaly to the metric anomaly chart.

Limitation of preview text for default recommended action

When an alert is generated from any default log anomaly detection models, the preview of a recommended action might contain partial texts and does not reflect the full view of the recommended action. The first 4000 characters are extracted from the original resolution or action document webpage where possible, from which nonreadable texts such as URLs are excluded to form the content of the preview text.

Extra anomalies appear in Alert Viewer after update from 3.6.1

After an update from IBM Cloud Pak for Watson AIOps 3.6.1, you might notice that extra anomalies appear in the Alert Viewer. These are generated by the SimpleRobustBounds algorithm. You can confirm this by clicking the suspect alert, which opens Alert details > Information. If it is a SimpleRobustBounds anomaly, in sender field, component contains SimpleRobustBounds and in insight.anomaly field SimpleRobustBounds appears in algorithmNames.

To avoid or resolve this problem, run Metric anomaly detection training as soon as possible after the 3.7 update.

Alert right-click menu items cannot be added if any actions have parameters with type 'array'

When automation actions exist that have any parameters of type 'array', the Add menu item button under menu configuration does not work as this type is not supported. The issue is limited to actions of type Ansible and Powershell which allow parameters of type 'array'.

To avoid or resolve this problem, identify actions with 'array' type parameters, and convert them to non-array parameters or delete them.

'Error 500' message displayed under Seasonality or Temporal correlation in Alert details side panel

On upgrading from IBM Cloud Pak for Watson AIOps 3.7.x to 4.1.0, you might encounter an 'Error 500' message in the Alert Viewer side panel where seasonality or temporal correlation details should be displayed.

To avoid or resolve this problem, delete the associated alert seasonality policy or temporal grouping policy and rerun the AI training. For more information, see Managing AI modelling.

Metric search page chart lines are disjointed

In the Metric search page chart, the normalised forecast line is disjointed from the baseline data line. This is because the forecast data is normalised independently from the baseline data. Although the lines might not match up, the values shown in the tooltips are correct.

Policies

Condition Values field changes "String:" to "Value of:"

For example, in a policy condition if the string "alert.id" is typed in the Values field, and then "String:alert.id" is selected, it is changed to "Value of:alert.id".

To prevent this, avoid a string that exactly matches the keyword. In this example, use the following condition instead:

Policy Condition
Figure. Policy condition

Note: This example is not an exact match of the string alert.id. This workaround finds all summaries containing alert. and .id.

Breadcrumb navigation missing in policy editor

In some cases where a policy name is long, the breadcrumb navigation in the top left of the policy edit session can be abbreviated. Clicking the breadcrumb still returns you to the Policy UI.

Condition "Matches" field for numeric "Operator" selections

When using a numeric Operator in a policy condition set (for example, greater than, less than, greater or equal) all options can be selected under Matches. However, always select Only for use with a numeric operator.

"Last run" time and "Matched" count updated in policies other than the trigger policy

In a case where an alert meets the incident-creation conditions of multiple policies, only one incident is created. However, all policies that proposed an incident, and the system incident creation policy has the same Last run times on the Policies hub. Each of these policies also increment their Matched counts by 1 in the Details tab of the side panel.

Double scroll bars on browser window

An extra scroll bar might appear on the right side of the browser window.

Policy processing failing due to long policy conditions

Policies that have conditions that are long (100,000+ characters) can cause policy processing to fail, resulting in no alerts or incidents being created. Failure can occur when temporal correlation training generates groups containing many alerts, and those alerts have long resource or type fields.

If this problem occurs, such policies should be disabled from the automation policy UI. Identify the policies by

  1. Filtering for analytics policies.
  2. Sorting by last run time (to identify those that have been processed recently and are likely triggering the problem).
  3. Viewing the specification for each to see whether any have a long condition.

Any policies that show up by implementing these steps should be disabled.

Upgrading: customizations to preset policies are lost

On upgrading, any renamed preset (or default) policies revert to their default policy name. Additionally, any customizations made to the following two preset incident creation policies will be lost:

  • Default incident creation policy for high severity alerts
  • Default incident creation policy for all alerts

No ServiceNow ticket created by incident creation policy

An alert meets the incident-creation conditions of a policy, but no ServiceNow ticket is associated to the incident.

To avoid this, complete the following steps under actions in the policy editor:

  1. Click Assign and notify.
  2. Select In an existing channel.

Datarouting pods can fail and need to be restarted

After installation, data for display in the UI, such as in the policy list, can be missing or stale.

If this issue occurs, restart the datarouting pods. You can identify these pods by using the following command:

oc get pods | grep datarouting.

Using default values in automatically run runbooks not working

This problem can occur by selecting a Default parameter value when creating a policy to assign runbooks to alerts. The useDefault value is not passed during automatic execution but is passed during manual execution.

You can execute the runbook manually from the Runbooks page.

Netcool alert not suppressed when X in Y suppression policy conditions are met

If you create an X in Y suppression policy that matches an alert originating from an IBM Tivoli Netcool/OMNIbus environment, the alert will not be suppressed.

Secure Tunnel

Secure Tunnel connector is not running after a restart with Podman

This issue can occur when you install the Secure Tunnel connector to a host machine on which Podman is installed. When the host machine is rebooted and the Secure Tunnel connector is checked by using the podman ps -a command, the Secure Tunnel connector container does not display running status.

If this issue occurs, the podman-restart service must be activated by using the systemctl command:

systemctl start podman-restart
systemctl enable podman-restart

After entering the command, check podman-restart worked by using the following command:

systemctl status podman-restart

If the Connector is still not running, try restarting the host machine.

Runbook Automation

When alert.suppressed value is used, runbook does not automatically run

Normally, you can select a runbook and configure it to run automatically: when an alert is converted to an incident, the runbook is assigned and runs automatically. However, if the parameter value alert.suppressed is used, the runbook does not run automatically as it reads this as a Boolean value rather than a string value. Therefore, it is necessary to manually run the runbook.

AIOps insights

Number of Incidents reported in Noise reduction chart is inconsistent with number of Alerts and Events

In the AIOps insights dashboard, the Noise reduction chart normally indicates the number of alerts, events, and incidents reported over a specific time period. However, the inclusion of historical data – containing longstanding, unresolved alerts – can result in a skewering of the data that is presented on the chart: the number of incidents that are presented can outnumber the number of alerts and events. Normally, this number is less than either – alerts reduce to a smaller number of events and events reduce to a smaller number of incidents.

The anomalous incident number happens because the reduction time frame covers alerts and events that are generated in the time period selected (for example, 7 Days). However, the incidents are generated from all outstanding alerts, including alerts that are not resolved, historically: alerts that occurred before the selected time period. So, in these circumstances, while the number of alerts and events is correct, the number of incidents is not.

AIOps insights dashboard fails to load even when data is available

Large amounts of data can cause the dashboard to fail to load or time out with the message Error – Metrics unavailable displayed for each chart. The problem is a scaling issue. The AIOps insights dashboard is not yet developed enough to handle huge amounts of data. A possible workaround is to increase resources for insights-api and elasticsearch pods. However, this approach might not be successful.

Events not showing up on Noise reduction chart

The charts in AIOps insights cover a timeline no greater than 30 days. The dashboard reads the firstOccurenceTime value from only within that period. If an alert was created outside of that timeline, and deduplicated, it is not added to the eventCount in the AIOps Insights Noise reduction chart. In this scenario, the eventCount for the alert increments in IBM Cloud Pak for Watson AIOps, but not in the Events segment of the Noise reduction chart.