Known issues and limitations
Review the known issues for IBM Cloud Pak® for Watson AIOps.
Additionally, review the troubleshooting documentation for more information on common issues. For more information, see troubleshooting.
- Install and upgrade
- Access control
- Observers and connections
- IBM Cloud Pak for Watson AIOps and Netcool Operations Insights
- Applications and topologies
- Infrastructure automation
- UI console
- AI model management and training
- ChatOps
- Incidents and alerts
- Policies
- Secure Tunnel
- Runbook Automation
- AIOps insights
Install and upgrade
- OpenShift Container Platform 4.10 FIPS limitation
- Limitation on number of instances
- Manual resource adjustments are not persisted
- Services fail to connect to Cassandra
- The
ibm-aiops-orchestrator
pod throws anOOMKilled
error - Automatic approval required for installation
- Kong gateway pod is stuck in CrashLoopBackOff restarting issue
- Elasticsearch health status yellow after restoring from a backup
- ChatOps Microsoft Teams integration does not work with a proxy server
OpenShift Container Platform 4.10 FIPS limitation
The ASM operator fails to create secrets on FIPS clusters for Red Hat OpenShift Container Platform Version 4.10.
Limitation on number of instances
IBM Cloud Pak for Watson AIOps and Infrastructure Automation can co-exist on the same cluster, but you cannot have multiple instances of IBM Cloud Pak for Watson AIOps or Infrastructure Automation on the same cluster.
Manual adjustments are not persisted
Custom patches, labels, and manual adjustments to IBM Cloud Pak for Watson AIOps resources (such as increased CPU and memory values) are lost when an event such as upgrade, pod restart, resource editing, or node restart triggers a reconciliation. Reconciliation causes any manually implemented adjustments to be reverted to their original default values. Depending on the parameters that you want to adjust, you might be able to use a custom profile to persist your changes. For more information about custom profiles, see Custom profiles.
Services fail to connect to Cassandra
After you install IBM Cloud Pak for Watson AIOps for a production environment deployment, various services might not be available due to connection issues with Cassandra. To resolve this issue if it occurs, restart Cassandra and the schema creation pods.
The ibm-aiops-orchestrator
pod throws an OOMKilled
error
If your environment has many secrets and ConfigMaps, when the ibm-aiops-orchestrator
(lead operator) attempts to build its cache, the operator can exceed its memory allocation and cause a Kubernetes out-of-memory error for the
container. This error can prevent the IBM Cloud Pak for Watson AIOps installation from reconciling, blocking the installation from completing.
If you encounter this issue, the operator requires more memory resources to build its cache. Override the subscription resource to increase the memory limits for the pod and avoid the out-of-memory issue.
Automatic approval required for installation
The use of manual approval strategies for InstallPlans in a project (namespace) can affect the IBM Cloud Pak for Watson AIOps installation.
For instance, if you use manual approval for any of your InstallPlans to install operators in All Namespaces
mode (cluster
scope), the manual approval can affect your install. The installation of IBM Cloud Pak for
Watson AIOps requires automatic approval to be used.
Kong gateway pod is stuck in CrashLoopBackOff restarting issue
In some case, the Kong gateway pod might have problems reaching a ready state. If this issue occurs, the Kong gateway pod can get stuck in CrashLoopBackOff
and keep restarting. If you check the Kong gateway pod, you can see an
error message similar to the following message:
bind() to unix:/usr/local/kong/stream_rpc.sock failed (98: Address already in use)
This issue occurs due to the Kong gateway proxy container nginx having a problem. To resolve this issue, manually delete the Kong gateway pod with the following command:
oc delete pod gateway-kong-xxxxxxxxx-xxxxx
Where gateway-kong-xxxxxxxxx-xxxxx
is the name of the pod.
Elasticsearch health status yellow after restoring from a backup
When you are restoring an Elasticsearch backup to a new single node Elasticsearch cluster, the Elasticsearch database might not work as expected. Instead, the Elasticsearch cluster health shows a yellow status after the restore completes.
ChatOps Microsoft Teams integration does not work with a proxy server
If you have an offline (air-gapped) deployment of IBM Cloud Pak for Watson AIOps or an environment that uses a proxy server, then you cannot use the ChatOps Microsoft Teams connection. The use of a proxy with the Chatops Microsoft Team connection is not supported.
Access control
- Automation Analyst role unused in IBM Cloud Pak for Watson AIOps
- Service Administrators cannot manage others roles or view role details
- Users from a user group remain after user group is deleted
- UI redirects to unexpected page when logging in after a session timeout
- Users in a user group are not listed under the Manage assignees pane of the Incidents and alerts page
Automation Analyst role unused in IBM Cloud Pak for Watson AIOps
By default, an Automation Analyst role is displayed within the IBM Cloud Pak Automation console Access control page when you are assigning a role to a user. This default role is used within the IBM® Automation family of offerings, which includes IBM Cloud Pak for Watson AIOps, however, this role is not used within IBM Cloud Pak for Watson AIOps.
This role does not include or provide any permissions within IBM Cloud Pak for Watson AIOps and should not be assigned to users within IBM Cloud Pak for Watson AIOps.
Service Administrators cannot manage others roles or view role details
Users with the Service Administrator role do not have permission to add or update a role, or view the details of a user's assign role. If a user with the Service Administrator role selects to view details about a role, a 401 error page is instead displayed.
Users from a user group remain after user group is deleted
When you delete a user group, the users that were included in the group remain in your list of users. Any role that is inherited through the deleted user group is removed from the users. If the users were assigned roles individually, they continue to have those roles and can continue to log in to the UI console and complete tasks. If the users that were in the deleted user group need to be removed completely, an administrator needs to manually remove the users. Users can be removed by clicking the Delete icon for the user's entry within the list of users on the Access control Users tab.
UI redirects to unexpected page when logging in after a session timeout
After a session timeout occurs and a user logs in to the UI console again, the user can be redirected to a different page than the page that they were on when the timeout occurred. For instance, a user that was working on the AI Model Management training page when their session timed out might be redirected to a graphql playground page after logging back in. This redirect occurs because the UI uses the last request URL that included the expired token to identify where to redirect the user when the user logs back in. If this redirect occurs, the user needs to manually return to the expected page in the UI to continue working.
Users in a user group are not listed under the Manage assignees pane of the Incidents and alerts page
When you have users within a user group and view the Manage assignees pane of the Incidents and alerts page, you might not see some users listed. This error can occur when the users from the LDAP user group are not individually onboarded. To verify whether a user is onboarded, go to the Access control > Users tab and check whether the user is listed. If the user is not listed, that user must first log into the console, which validates their roles and permissions. After logging in, the user can display in the list of users and on the Manage assignees pane.
The Manage assignees pane is viewable from the list of all Incidents. Select an incident and then click Manage assignees. After you select an existing user group, you should see the included users listed.
Observers and connections
- ServiceNow Observer UI displays superfluous character
- Scheduled job for ServiceNow observer fails after upgrade
- 'Failed to read certificate' error
- Duplication of resources in the topology if certain observer job parameters are changed after a job has been run
- Incomplete historical data processing in the event of connector pods restarting
- File and Rest observer topology service location URL not accessible
- Connector console displays special characters incorrectly
- Turbonomic integration with Watson AIOps affects other integrations in Turbonomic
- No notification in Watson AIOps on Turbonomic actions closed without execution
- New Relic observer does not support dashboard tokens for new users
- All dates and time are in US-en format
- Appdynamic historical start date and time cannot be older than 4 hours
- Appdynamic live mode aggregation interval is 1 minute
- The observer-service pod is in crash back loop due to ghost vertex
- Alerts for Instana without associated topologies
- Alerts for Instana topology not mapping properly
- Instana metric collection API rate limit exceeded error
- IBM Cloud Pak for Watson AIOps cannot close some alerts when an Instana connection exists
- Changing the codec for a connection can cause errors
- Delay query time for connections
- When creating a kubernetes observer job that is named 'weave_scope', 'load', 'kubeconfig', or 'local', the job fails to run
- Log Anomaly 8k limit on field mapping in details field of the alert schema
- Log connection does not start when multiple log connections are active
- After a restore, data from connections is not processed
- Turning on a disabled Dynatrace connection that collects live data results in an error
- Historical data from ServiceNow instance gets collected only when the historical data flow is reenabled
- Cannot edit or disable AppDynamic or Dynatrace connection due to 400 Bad Request error
- Dynatrace connector pod restarted and does not retrieve all historical data
- AppDynamics and Dynatrace: unable to create a connector with description or special characters in name
- ServiceNow user account gets locked out after a few hours
- Scale resources when running log anomaly training on large data
- The connection status for Elk, Custom Logs, LogDNA, and Falcon LogScale sometimes show 'not running' even though the Flink job and gRPC pod are running correctly
- Log data connections status is "Done" even though historical data is still loading
- IBM Tivoli Netcool/Impact connection stops event processing with exceptions
- Impact connector fails for Impact server with non-default cluster name
ServiceNow Observer UI displays superfluous characters
If the Entity_mapping field is updated from the Observer UI, superfluous curly brackets and quotes are displayed in the Resource type mapping field.
Workaround: This is a cosmetic issue and can be ignored.
Scheduled job for ServiceNow observer fails after upgrade
Following the latest upgrade, a previously scheduled ServiceNow Observer job enters an Error state when it runs.
Cause: An observer job was started with tables that are no longer recognized after the upgrade.
Workaround:
- On the Data and tools connections page, select the ServiceNow connection that was created before the upgrade and click Edit on the overflow menu.
- Navigate to the Collect topology data (optional) page and specify the values for the 'ServiceNow tables to be discovered by observer' and 'Maximum number of tables per cmdb_ci_rel api batch call' fields.
- Click Next, and then Save to save the changes. The job should now run correctly as scheduled.
'Failed to read certificate' error
This error can occur when an observer attempts to create an SSL certificate and the endpoint server does not respond.
In the example error message below, a vmvcenter.crt certificate error occurs because the endpoint server does not respond.
Failed to read certificate for field [certificate].
The file 'vmvcenter.crt' could not be found under /opt/ibm/netcool/asm/security/.
Workaround: Ensure the endpoint server is running correctly.
Duplication of resources in the topology if certain observer job parameters are changed after a job has been run
Certain resource parameters are used to uniquely identify a resource. If one of these parameters is changed after the initial job run, then any subsequent job run will result in duplicate records. For example, if the parameter of 'hostname' is replaced with 'Ipaddress' after a topology has been created, a subsequent discovery will consider the resource as new, and create a duplicate record.
The following resource parameters uniquely identify a resource. Changing them after the initial job has been run will result in duplicate records.
Workaround: If you need to modify these values, do not modify the existing job. Instead, create a new job.
Observer | Job parameter |
---|---|
ALM | n/a |
AppDynamics | account |
AWS | region , dataTenant |
Ansible AWX | host, user |
Azure | data_center |
BigFix Inventory | data_center |
Big Cloud Fabric | proxy-hostname , proxy-username , bcf-controllers |
Ciena Blue Planet | data_center , tenant |
Cisco ACI | tenant_name |
DNS | addressTypes , server , port , recurse |
Docker | endPoint.port |
Dynatrace | datatenant , hostname |
File | provider , file |
GitLab | datatenant , hostname |
GoogleCloud | project_id |
HPNFVD | datacenter , username , cnf_job |
IBM Cloud | instance , username , region |
ITNM | instance |
Jenkins | jenkins_observation_namespace |
Juniper CSO | cso_central_ms_url , user_domain_name , domain_project_tenant_name |
Juniper Contrail | api_server_url , os_project_name , os_tenant_name |
Kubernetes | data_center , namespace |
NewRelic | accountName , accountId |
OpenStack | data_center , os_project_name |
Rancher | accessKey , clusterId |
REST | provider |
SDC ONAP | host , username |
ServiceNow | instance_url , username |
SevOne | datatenant , hostname |
TADDM | api_url , username |
Viptela | data_center |
VMware NSX | data_center |
VMware vCenter | data_center |
Zabbix | data_center |
Incomplete historical data processing in the event of connector pods restarting
If you create a connection to collect historical data for Metric Anomaly AI Training, you might come across an issue where the connector pod restarts, but does not retrieve all historical data for training. As a result, you might suffer data loss.
A connector pod can restart due to outages, the target system crashing, or pod crashes in the environment. This issue can occur intermittently, depending on the number of metrics that are selected for the connection and the amount of data to be retrieved.
File and Rest observer topology service location URL not accessible
When creating an edge via either the File or Rest observers, the POST request returns a Topology service location URL that is not accessible. The URL cannot be used to manage the edge because the relevant API is not exposed. Workaround: None
Connector console displays special characters incorrectly
If you use special characters in the Name and Description fields while creating or editing a connection, the Connector console might display the special characters incorrectly. Nevertheless, the connection is saved.
Turbonomic integration affects other integrations in Turbonomic
The Turbonomic integration with IBM Cloud Pak for Watson AIOps enables actions that are created or executed in Turbonomic to be notified in AIOps through the enabled webhook workflow. However, Turbonomic allows only one webhook workflow per action. Therefore, other integrations that are enabled in Turbonomic, like ServiceNow, might not get any notification when actions are created or executed in Turbonomic.
No notification in Watson AIOps on Turbonomic actions closed without execution
Watson AIOps does not receive any notification from Turbonomic for actions that are closed without being executed. For example, an action related to an erroneous condition that is no longer occurring gets automatically closed in Turbonomic. But its corresponding AIOps alert remains open indefinitely and must be cleared manually from the console.
New Relic observer does not support dashboard tokens for new users
For new users of the New Relic observer, the observer does not work as it no longer supports the New Relic One dashboard token. However, it will continue to work for existing users who are using the old token that was generated previously through the old dashboard.
All dates and time are in US-en format
When you are scheduling data collection for a connection, all dates and times are presented in the US-en formats:
- All dates are configured and presented in the
mm/dd/yyyy
format. - All times are configured and presented in the
hh:mm
AM
/PM
12-hour clock format.
You cannot switch the date or time format.
Appdynamic historical start date and time cannot be older then 4 hours
Historical start date and time is configurable, but if you set it to beyond the past 4 hours, then the connector will ignore it and only retrieve the past 4 hours of data.
Appdynamic live mode aggregation interval is 1 minute
In live mode, the new aggregation interval allowed is only 1 minute.
The observer-service pod is in a crash back loop due to a ghost vertex
If you notice that the topology observer-service pod is not functioning correctly and that restarting the pod does not correct the issue, a ghost vertex might need to be removed. To remove the vertex, you need to traverse an edge to the vertex, and then delete the vertex. To traverse to the vertex, use the type vertex and definesType edge.
-
Run the following command to find the ID for the type vertex.
oc exec -it <topology pod> -- curl -X GET --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/types?_filter=keyIndexName=ASM::entityType::mgmtArtifact::ASM_OBSERVER_JOB' -u <username><password> --insecure
Where
<username>
- Your Topology AIOps API username<password>
- Your Topology AIOps API password
-
Run the following command to use the definesType edge to get the ID for the vertex that is causing the issue.
oc exec -it <topology pod> -- curl -X GET --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/resources/<type vertex ID>/references/out/definesType' <username><password> --insecure
Where
<type vertex ID>
- The ID for the type vertex that you retrieved in step 1.<username>
- Your Topology AIOps API username<password>
- Your Topology AIOps API password
-
Run the following command to delete the vertex.
oc exec -it <topology pod> -- curl -X DELETE --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/resources/<type vertex ID>/references/out/definesType?_delete=nodes&_delete_self=false&_node_label=vertex' -u <username><password> --insecure
Where
<type vertex ID>
- The ID for the type vertex that you retrieved in step 1.<username>
- Your Topology AIOps API username<password>
- Your Topology AIOps API password
Changing the codec for a connection can cause errors
When you are creating or editing a connection with the Data and tools connections tool, avoid changing the codec property value that is set in the mapping field. This property is not a configurable property. The IBM Cloud Pak Automation console
sets the correct codec for the connection. If you change the value, the connection might fail, or it might not retrieve data for IBM Cloud Pak for Watson AIOps. If you do need to set or change the value, ensure that you use the correct codec
for the connection, such as splunk
for a Splunk connection, elk
for an ELK connection, or Falcon LogScale
for a Falcon LogScale connection.
Search on the Add connection page does not function as expected
When you search on the Add connections page, you can view connectors that exist for other categories than the category under which you are searching. The search shows connectors that match the search query, regardless of the category of the connector.
Delay query time for connections
If you set a connection to retrieve Live data for continuous AI training and anomaly detection or Live data for initial AI training, you might need to configure a delay to offset the query time window to provide a time buffer for preventing the partial retrieval of real-time data. You cannot configure this delay within the UI console. You must use a command line to configure the delay. For more information about configuring this delay, see Delay configuration in data connections.
Alerts for Instana without associated topologies
In some cases Instana alerts will not have associated topologies. In most cases this happens because the resource that originated the event is no longer available in Instana. For example, a pod that triggers an Instana event can be redeployed by the underlying kubernetes engine.
Alerts for Instana topology not mapping properly
In some cases alerts are not mapped correctly to a corresponding Instana topology node. For example, alerts generated from log anomaly detection or metric anomaly detection (or other sources) might not show as associated with a Instana topology node.
As a workaround, you need to define your own match rules to correlate with the source data. To define a match rule, click on Resource Management, then click Settings, Topology Configuration, and finally Configure on the Rules Tile. When you are configuring the match token values to use, the values depend on the data that you are sending to Instana.
Instana metric collection API rate limit exceeded error
The recommended rate limit will be double the number of resources. There might be situations where the limits need to be increased. When a more precise limit is required, use the following formula to determine the limit to use:
number-of-metric-API-calls-per-hour ~= (number-of-selected-technologies x 2) x (snapshots-for-selected-technologies / 30) x (60 / collection-interval)
number-of-topology-API-calls-per-hour ~= (number-of-application-perspectives x (60 / collection-interval)) +(number-of-services x (60 / collection-interval))
number-of-events-API-calls-per-hour = 60
total= number-of-metric-API-calls-per-hour + number-of-topology-API-calls-per-hour + number-of-events-API-calls-per-hour
Note: Each plugin can have a different number of metrics collected. The mean value across these is used, which is 2 collection cycles per plugin. If the environment is unbalanced, for instance you have mostly hosts that define most metrics, then the formula might underestimate the required limit.
To determine the number of resources (snapshots) for each infrastructure pluginAPI, use the following:
api/infrastructure-monitoring/snapshots?plugin=technology_name
Example:
api/infrastructure-monitoring/snapshots?plugin=host
For more information about Instana APIs, see Instana API.
The following example curl commands allow you to retrieve the number of:
-
snapshots-for-selected-technologies (such as host)
curl -k -s --request GET 'https://<instana server hostname>/api/infrastructure-monitoring/snapshots?plugin=host' --header 'Authorization: apiToken <api token>' | jq '.items|length'
-
number-of-application-perspectives
curl -k -s --request GET 'https://<instana server hostname>/api/application-monitoring/applications' --header 'Authorization: apiToken <api token>' | jq '.items|length'
-
number-of-services
curl -k -s --request GET 'https://<instana server hostname>/api/application-monitoring/services' --header 'Authorization: apiToken <api token>' | jq '.items|length'
IBM Cloud Pak for Watson AIOps cannot close some alerts when an Instana connection exists
If you have an Instana connection created and have Instana 221 (Saas or Self-Hosted), you might encounter an issue where IBM Cloud Pak for Watson AIOps might not be able to close some alerts. Instead, you might need to check the status for the associated event within the Instana dashboard and clear the alert manually. For more information, see Troubleshooting connections: Instana Event integration.
When creating a Kubernetes observer job that is named 'weave_scope', 'load', 'kubeconfig', or 'local', the job fails to run
If you create a Kubernetes observer job with the name weave_scope
, load
, kubeconfig
, or local
, the job always fails to run. When this error occurs, you can view an error icon in the schedule
column for the job. To avoid this issue, do not use these names for the observer job.
Log Anomaly 8k limit on field mapping in details field of the alert schema
The limitation is that the datalayer imposes an 8 kb size limit on the details field in the alert schema. The details field is populated by the log anomaly event, which provides the relevant information to display in slack when trying to view alerts in the chatops. Whenever the details field size exceeds 8 kb, the returned json object is truncated and therefore when the user clicks view alerts to retrieve the alerts related to an incident, expected results are not seen and an error is recorded.
The current fields under the details objects are:
end_timestamp: int
original_group_id: str
causality: dict
detected_at: float
source_application_id: str
log_anomaly_confidence: float
log_anomaly_model: List[str]
prediction_error: dict
error_templates: List[int]
count_vector: List[int]
text_dict: dict
application_group_id: str
application_id: str
model_version: str
severity_from_model: int
description: str
Log connection does not start when multiple log connections are active
If you have many active log connections, such multiple Kafka, ELK, Splunk, and Falcon LogScale connections, and you create and enable another Falcon LogScale connection, you might notice that the connection status is stuck in an error or restarting state. This state can even occur after the connection is operating as expected.
This can occur if you exceed the limit for the number of jobs that can run on the underlying service, which results in insufficent resource available to start the connector. To resolve this issue, complete one or more of the following tasks:
- Increase the replica count of your task managers.
- Increase the task manager count per replica.
- Change the parallelism of your connections.
- Cancel other connections
After a restore, data from connections are not processed
If you have a connection that you are restoring, the status for these connections can be in error after the restore process completes. To resolve this status, you need to edit and save your connections with the Data and tool connections in the IBM Cloud Pak for Watson AIOps IBM Cloud Pak Automation console. Editing the connection regenerates the associated Flink job for the connection, which updates the status.
Turning on a disabled Dynatrace connection that collects live data results in an error
If you have a connection to Dynatrace enabled for live data collection and then disable the connection, enabling the connection again can result in a java.lang.NullPointerException
error occurring. If this error occurs, delete
and then create the connection again to enable the Dynatrace data collection.
Historical data from ServiceNow instance gets collected only when the historical data flow is reenabled
If you enable historical data flow for a ServiceNow connection, you might notice that the historical data is not collected from ServiceNow. For instance, when you check the grpc-snow
pod, you can see ticket data available, but
when you check the Flink job or in Elasticsearch, you can notice that no data was collected. If this issue occurs, turning off the historical data flow and turning it back on can cause the data to begin to be collected.
Cannot edit or disable AppDynamic or Dynatrace connection due to 400 Bad Request error
If you have created an AppDynamic or Dynatrace connection, you might not be able to disable the connection or edit the collection mode, such as to change the mode from historical to live or live to historical. If this issue occurs, a 400 Bad Request error
message displays when you attempt to disable or edit the connection. Instead of disabling the connection, delete the connection and create a replacement connection when needed. As a workaround if you cannot edit the connection, you can create
a replacement connection with your preferred settings.
Dynatrace connector pod restarted and does not retrieve all historical data
If you have a Dynatrace connection created and pull historical data with multiple metrics for Metric Anomaly AI Training, you can encounter an issue where the Dynatrace pod restarts, but does not complete retrieving the expected historical data for training. This issue can occur intermittently, depending on the number of metrics that are selected for the connection and the amount of data to be retrieved.
If this potential out-of-memory or out-of-resources issue occurs, consider creating separate connections to monitor different and smaller sets of metrics. By splitting the connections, you can reduce the amount of data to be retrieved through the initial connection that can cause this issue.
AppDynamics and Dynatrace: unable to create a connector with description or special characters in name
You cannot use spaces or special characters for names for AppDynamics and Dynatrace connectors. You can only use alphanumeric values.
ServiceNow user account gets locked out after a few hours
If there is an active ServiceNow connection with data collection enabled and the ServiceNow credentials change, the ServiceNow user account can get locked out. ServiceNow has an automatic login locking script called "SNC User Lockout Check", which locks users out after more than 5 failed attempts (including any failed API calls).
If you check the Incidents and alert page, you will see also an alert saying "ServiceNow instance authentication failed".
When this problem occurs, unlock the user in ServiceNow. Then change the password in the ServiceNow connection and save. When authentication fails in the ServiceNow connector, there is a 1-minute wait time before you can access it, to prevent a lockout from occuring quickly.
Scale resources when running log anomaly training on large data
In some cases it is observed that log anomaly training fails on large data due to being Out Of Memory (OOM) or if there is a problem with ES shards. The solution is to scale up the resources to handle large data training.
For more information about shard management, see About indices and shards. For more information about increasing ES Resources, see Log anomaly training pods CPU and Memory resource management.
The connection status for Elk, Custom Logs, LogDNA, and Falcon LogScale sometimes shows 'not running' even though the Flink job and gRPC pod are running correctly
On creating a connection, the Flink job retrieves data normally and the gRPC pod is running without error. However, the console shows that the connection status is 'not running'.
Log data connections status is "Done" even though historical data is still loading
When a log data connection (Falcon Logscale, ELK, LogDNA, Custom, Splunk) is running in Historical data for initial AI training mode, and a custom regex is added in the field_mapping
section, the data processing
can take a long time. Although the Data collection status might be shown on the UI as Done, data could still be being processed and written to Elastic in the background.
To speed up this process, you can increase the Base parallelism number that is associated with that connection. For more information, see Increasing data streaming capacity.
IBM Tivoli Netcool/Impact connection stops event processing with exceptions
If you have an IBM Tivoli Netcool/Impact connection, you can encounter an issue where the connection temporarily stops processing during the sending of an event stream to IBM Cloud Pak for Watson AIOps.
This issue can occur when you have an IBM Cloud Pak for Watson AIOps policy that triggers an IBM Tivoli Netcool/Impact policy when certain types of events are received. If this issue occurs and stops the event processing, the Impact connector logs or Impact policylogger logs can include messages that are similar to the following example exceptions:
[6/14/23, 11:38:45:816 UTC] 0000005d ConnectorMana W failed to send status update
...
[6/14/23, 11:38:45:815 UTC] 000023ca StandardConne W configuration stream terminated with an error
...
[6/14/23, 11:38:45:816 UTC] 000023cc GRPCCloudEven W consume stream terminated with an error: channel=cp4waiops-cartridge.lifecycle.output.connector-requests
If you encounter this issue, wait for the issue to resolve. This issue resolves itself over time to invoke the policy and begin to process the event stream again.
Impact connector fails for Impact server with non-default cluster name
If the Netcool/Impact cluster uses a non-default cluster name ("NCICLUSTER"), the connector may fail to validate the connection. The Impact server may report DynamicBindingException
errors in the impactgui.log:
com.micromuse.common.nameserver.DynamicBindingException: DynamicBindingException: Service [NCICLUSTER] not in nameserver.
To resolve the issue, wait for the backend Impact server to finish initializing before starting or restarting the Impact GUI server.
If Netcool/Impact is running fix pack 7.1.0.26 or later, you can also resolve the issue by setting the nameserver.defaultcluster
property in the GUI server. Add the following line to $IMPACT_HOME/etc/nameserver.props:
impact.nameserver.defaultcluster=CLUSTERNAME
where CLUSTERNAME is the name of the Impact cluster.
IBM Cloud Pak for Watson AIOps and Netcool Operations Insights
- IBM Cloud Pak for Watson AIOps Strimzi Kafka topics created without replication
- IBM Cloud Pak for Watson AIOps pods not starting after a cluster restart
- Error when accessing the AI Model Management
- NOIHybrid incorrectly listed in provided APIs for IBM Netcool Operations Insight operator
- Rare issue: unable to deploy log anomaly detection model
IBM Cloud Pak for Watson AIOps Strimzi Kafka topics created without replication
IBM Cloud Pak for Watson AIOps supports multiple replications of Kafka topics for large production installations, such as for data redundancy. The IBM Cloud Pak Automation console can automatically create Kafka topics when connections are
created. When a topic is dynamically created by the IBM Cloud Pak Automation console, the replication is set to 1
in the controller. As such, Kafka topics the are created during installation can have multiple replicates, but
those topics that are created dynamically do not.
If you are implementing a production (large) deployment of IBM Cloud Pak for Watson AIOps, you might lose data if your Kafka pods fail or restart. If the data flow is enabled in your Kafka connection when the Kafka pods go down, you might experience a gap in the data that your connection generated during that down period. Upgrades or updates to workers can cause a Kafka broker restart.
You can manually modify the Kafka topic replication inside the Kafka container from a value of 1
to 3
to mitigate any potential data loss from this issue.
IBM Cloud Pak for Watson AIOps pods not starting after a cluster restart
When the cluster is restarted, all nodes have a STATUS of Ready. Most pods return to a STATUS of Running except for some IBM Cloud Pak for Watson AIOps pods.
One potential cause is that Elasticsearch must be up and running before the IBM Cloud Pak for Watson AIOps pods can start.
Restart the Elasticsearch pod to get all pods back to a STATUS of Running.
Error when accessing the AI Model Management
The AI Model Management can fail to load when you click to open the tool from the Quick navigation links on the Home page.
This can occur if a network disruption occurred during installation. This disruption can result in the cluster becoming inaccessible and cause some steps to be missed during the platform UI startup. This disruption can result in expected Nginx rules to be missing.
To check whether the rules are missing, complete the following steps:
-
Open a command line and connect to your cluster with the
oc login
command. -
Use the
oc project
command to set the context to the project where IBM Cloud Pak for Watson AIOps is deployed. -
Locate an
ibm-nginx
pod within your deployment:ibm_nginx_pod=$(oc get pods | grep ibm-nginx | head -1 | cut -f1 -d\ ) echo $ibm_nginx_pod
-
Run the following
oc exec
command for one of theibm-nginx
pods to run commands against that pod.oc exec $ibm_nginx_pod -- nginx -T | grep aimodel
After the command runs, check whether there is an ingress rule or entry for
aimodels
. Your output can resemble the following sample output:location ~* (/aiops/([^/]+)/aimodels|/aiops/aimodels) { nginx: the configuration file /usr/local/openresty/nginx/conf/nginx.conf syntax is ok nginx: configuration file /usr/local/openresty/nginx/conf/nginx.conf test is successful
If you do not view an
aimodels
rule, complete the following steps to add the rule. -
Open the platform UI extension configuration for the AI Model Management for editing:
oc edit cm aiops-ai-model-ui-zen-extension
-
Increment the
icpdata_addon_version
version metadata label within the configmap to be3.3.2
.icpdata_addon_version: 3.3.2
NOIHybrid incorrectly listed in provided APIs for IBM Netcool Operations Insight operator
NOIHybrid is incorrectly included in the Provided APIs list for the Netcool Operations Insight operator. This list is displayed in the Red Hat OpenShift Container Platform web console under Installed Operators > Netcool Operations Insight > Operator Details. Do not use NOIHybrid APIs.
Rare issue: unable to deploy log anomaly detection model
Very occasionally, following successful completion of log anomaly detection training, an error similar to the following is displayed in the AI management UI training page following an attempt to deploy the model.
Error
Model deployment failed
Within the error textbox, you will also see the text "Forbidden".
If you investigate the aiops-ai-model-ui pod logs, you will also see the following error.
ForbiddenError: invalid csrf token
If this occurs, first refresh the browser and try to deploy again.
If that does not remedy the situation, then log out and log back in, and then try to deploy the model again.
Applications and topologies
- JVM heap out-of-memory (OOM) failures when loading large number of resources
- Critical error message displayed when attempting to render an application
- Different date and time in Automation console and ChatOps between users
- Deleting a tag template can cause out-of-memory errors
- The Find Path tool ignores filters
- Fault localization and blast radius are not producing accurate results
- High volumes of data can cause Spark workers to run out of memory
- Probable cause is not producing accurate results
- Azure observer missing subnet relationship in topology
- Topologies not visible on Incident Topology page after resource merge
- Openstack observer missing edge-runsOn connectivity in topology
JVM heap out-of-memory (OOM) failures when loading large number of resources
When running topology loads in quick succession, it is possible to experience some OOM errors and undesired topology pod restarts, even though the pods will continue the processing after restarting.
This error can occur when running resource loads of several millions in a large deployment, and could slow down the loading process. The following type of error message can be seen in the pod logs:
WARN [2022-10-25 15:43:31,906] [JanusGraph Session-io-4] c.d.o.d.i.c.c.CqlRequestHandler - Query ‘[4 values] SELECT column1,value FROM janusgraph.graphindex WHERE key=:key AND column1>=:slicestart AND column1<:sliceend LIMIT :maxrows [key=0x02168910cfd95b7e3bc74006a4a8a73a79c71255a0726573...<truncated>, slicestart=0x00, sliceend=0xff, maxrows=2147483647]’ generated server side warning(s): Read 5000 live rows and 1711 tombstone cells for query SELECT value FROM janusgraph.graphindex WHERE key = 02168910cfd95b7e3bc74006a4a8a73a79c71255a07265736f757263e5 AND column1 > 003924701180012871590290012871500290 AND column1 < ff LIMIT 5000; token 9157578746393928897 (see tombstone_warn_threshold) JVMDUMP039I Processing dump event “systhrow”, detail “java/lang/OutOfMemoryError” at 2022/10/25 15:43:32 - please wait. JVMDUMP032I JVM requested System dump using ‘/tmp/cassandra-certs/core.20221025.154332.1.0001.dmp’ in response to an event
Cause: Not enough headroom exists between JVM memory limit and the pod memory limit, usually because one was increased without also increasing the other.
Workaround: Ensure that any changes in heap size maintain enough headroom between these settings.
Example: In this example (for a topology size1) the pod limits are set to 3.6 GB while the maximum memory for the JVM (-Xmx
) is set to 3 GB, thereby leaving 0.6 GB of headroom free for use by the OS.
size1:
enableHPA: false
replicas: 2
jvmArgs: "-Dcom.ibm.jsse2.overrideDefaultTLS=true -Xms1G -Xmx3G"
resources:
requests:
memory: "1200Mi"
cpu: "2.0"
limits:
memory: "3600Mi"
cpu: "3.0"
Critical error message displayed when attempting to render an application
This problem occurs when all of the groups, within the application you are attempting to render, have no members. When the application is selected in Application management, it does not render and a critical error message is displayed on the UI.
Avoid creating applications with no members. If an application with no members was created for test purposes only, then ignore this error.
Different date and time in Automation console and ChatOps between users
The date and time format for an Incident in the IBM Cloud Pak Automation console Application management tool and the associated ChatOps notification can be different between users. The format and time zone that is used in the Automation console and ChatOps notification is set to the user's locale. If different users are in different time zones, the displayed date and time are different in the Automation console and ChatOps notification.
Deleting a tag template can cause out-of-memory errors
If a tag is applied to a large number (that is, thousands) of topology resources, then deleting the tag template can cause out-of-memory errors with the topology-merge
pod.
Avoid creating tag templates that use tags that occur with such frequency. Later, do not tag thousands of resources with the same tag, and avoid them being used in a group.
The Find Path tool ignores filters
The topology path tool fails to launch with filters applied.
Launch the path tool without filters, then manually apply the filter settings on the path page.
Probable cause is not producing accurate results
The correlation algorithms for probable cause currently require the use of a Kubernetes model with service-to-service relationships, or the use of dependency relationships between non-Kubernetes resources.
Complete the following steps to create the required relationships for Kubernetes. This procedure configures Topology Manager to overlay relationships provided by the File observer onto the Kubernetes topology.
Note: The Kubernetes observer must be configured and loading data.
-
Log in to the IBM Cloud Pak Automation console.
-
From the main navigation, expand Operate and click Topology viewer.
-
From the topology navigation toolbar, expand Settings, click Topology configuration.
-
On the Rules tile, click Configure to navigate to the Rules administration page.
-
On the Merge tab, click New to create a New merge rule.
In this scenario, data that is provided by the File observer will be used to add the relationships.
-
Specify the following information on the New merge rule page:
-
Rule name:
k8-file-service
. -
Set the rule Status to Enabled.
-
Add the
uniqueId
property to the set of Tokens. -
Expand the Conditions section and select File and Kubernetes from the set of available Observers and click Add.
-
Specify
service
for Resource types and click Add. -
Click Save to save the new Merge rule.
-
-
Locate the services that you want to relate together and make a note of their source-specific uniqueId, such as
05f337a1-5783-43bb-9323-dfba941455c7
(shipping) andae076382-3df9-46cb-97e9-a0342d219efb
(web). -
Create a file for the File Observer that contains the
service-dependsOn-service
relationships necessary for the correlation algorithms to work.The following example creates two services, web and shipping, and states that web
dependsOn
shipping. Repeat this as required to relate your services together.V:{"uniqueId": "05f337a1-5783-43bb-9323-dfba941455c7", "name": "shipping", "entityTypes": ["service"]} V:{"uniqueId": "ae076382-3df9-46cb-97e9-a0342d219efb", "name": "web", "entityTypes": ["service"]} E:{"_fromUniqueId":"ae076382-3df9-46cb-97e9-a0342d219efb", "_edgeType":"dependsOn", "_toUniqueId":"05f337a1-5783-43bb-9323-dfba941455c7”}
-
Load this file into Topology Manager to relate the services. For more information, see Configuring File Observer jobs.
If your topology changes, then re-create and reload the file as required. A similar process can be followed for non-Kubernetes sources.
High volumes of data can cause Spark workers to run out of memory
If your environment handles high (10+ millions) workloads of alerts or event, your Spark workers can run out of storage (ephemeral storage). If you encounter this issue, restart the affected Spark workers. This issue can also occur if you are running multiple jobs, which can cause the file system to fill, such as with log or JAR files.
Azure observer missing subnet relationship in topology
For the Azure Observer, a subnet can be intermittently missing the relationship with an IP address in the topology for a resource. While the relationship can be intermittently missing, both the subnet and IP address verticies remain available in the topology.
Topologies not visible on Incident Topology page after resource merge
When resources from two observer sources have been merged using the topology merge functionality, the topology is no longer displayed in the Incident view. This known issue affects only the Incident view, and the topology is still present in all other views.
Openstack observer missing edge-runsOn connectivity in topology
After running an Openstack observer job, the edge-runsOn
connectivity between ComputeHost
and Hypervisor
elements is not shown in Resource Managenent -> Resources
when it should be.
Infrastructure Automation
- Kubernetes permissions are missing for user roles for using Managed services and the Service catalog
- Non-LDAP users cannot access Infrastructure management
- Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core are not supported
- Users are redirected to the Administration panel when logging back into the UI
- Managed services secrets and certificates cannot be customized during installation
- Database fails to reset when error occurs during database creation for Infrastructure management
- The
cam-tenant-api
pod is not in a ready state after installing the iaconfig CR - Print or export as PDF entire tables does not work as expected
- Infrastructure management log display in the UI is removed
- Infrastructure Automation Test deploy fails
- On Infrastructure management appliances, an Ansible playbook deployment fails
- After restoring from a backup, the Managed services deployment can fail
Kubernetes permissions are missing for user roles for using Managed services and the Service catalog
If you install Infrastructure Automation, you, or an administrator, must add the required Kubernetes permissions to user roles before your users can begin to access and use Managed services or the Service catalog.
As an administrator, add the following permissions to your use roles:
Role | Required permission for Infrastructure Automation |
---|---|
Automation Administrator | Administer Kubernetes resources |
Automation Operator | Manage Kubernetes resources |
Automation Developer | Edit Kubernetes resources |
Automation Analyst | View kubernetes resources |
For more information about how to add permissions to a role, see Managing roles for Infrastructure Automation.
Non-LDAP users cannot access Infrastructure management
Non-LDAP authenticated users cannot be used with Single Sign-On for Infrastructure management. If you attempt to use Infrastructure management with a non-LDAP authenticated user, you can encounter the following error:
While logged in to the Infrastructure Automation UI console with a non-LDAP user, attempting to start Infrastructure management fails with an error. This is a limitation.
The error states:
OpenID Connect Provider error: Error in handling response type.
Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core are not supported
Installation of Infrastructure management in IBM Cloud Pak for Watson AIOps does not support Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core. You can continue to use the Kubernetes cluster life-cycle templates and services to create a Kubernetes cluster and import the cluster to an existing installation of Red Hat Advanced Cluster Management, if an installation is available. Deploying hybrid applications are also not supported by Infrastructure Automation.
Users are redirected to the Administration panel when logging back into the UI
When you are working within Infrastructure Automation and log out and then log back in, you can be redirected to the Administration panel instead of the Infrastructure Automation home page or other page that you were previously using. If this occurs, you can use the Cloud Pak switcher in the upper right of the UI console to switch to the Infrastructure Automation home page and then return to the page that you were previously using.
Database fails to reset when error occurs during database creation for Infrastructure management
If you are creating the database for the Infrastructure management appliance and you encounter an error, such as the database creation failing to complete successfully, you might not be able to continue with your setup without redeploying. For instance, if the creation fails, resetting the database to clean up your database and deployment can also fail. To resolve this issue, you need to redeploy the Infrastructure management appliance image before reattempting to create the database.
The cam-tenant-api
pod is not in a ready state after installing the iaconfig CR
After you install Infrastructure Automation, you can encounter an error where the cam-tenant-api
pod displays as running, but not in a ready state. When this error occurs, you can see the following message:
[ERROR] init-platform-security - >>>>>>>>>> Failed to configure Platform Security. Will retry in 60 seconds <<<<<<<<<<<<< OperationalError: [object Object]
If this error occurs, delete the cam-tenant-api
pod to cause the pod to restart and attempt to enter a ready state.
Print or export as PDF entire tables does not work as expected
If you are using the Firefox browser, and you select Print or export as PDF on the Compute > Containers > Projects page to print or export the entire table of data, the print, or export might not work as expected. For instance, some data, such as table rows might be missing. If you encounter this issue, try a different browser for printing or exporting the data.
Infrastructure management log display in the UI is removed
Log display support on the UI is removed for Infrastructure management. As an alternative for viewing these logs, use Kubernetes standard methods such as oc log
commands, viewing the output in Red Hat OpenShift Container Platform
or Kubernetes, or setting up a log aggregator for your cluster.
You can still see the log tabs (Collect Logs, IA:IM Log, Audit Log, and Production Log) on the Settings > Application Settings Diagnostic page. However, instead of displaying the log information, the following message is displayed: Logs for this IA:IM Server are not available for viewing
.
Infrastructure Automation Test deploy fails
Infrastructure Automation Test deploy from a Service Overview page fails to deploy.
On Infrastructure management appliances, an Ansible playbook deployment fails
When you attempt to deploy an Ansible playbook on an Infrastructure management appliance through an embedded Ansible deployment, the playbook deployment can fail with the following error:
<35.237.119.31> ESTABLISH SSH CONNECTION FOR USER: ubuntu
fatal: [35.237.119.31]: FAILED! => {
"msg": "Unable to create local directories(/home/manageiq/.ansible/cp): [Errno 13] Permission denied: b'/home/manageiq'"
}
If you encounter this error, log in to the appliance as the root user and then deploy the playbook again:
-
Run the command:
mkdir -p /home/manageiq
-
Run the command:
chown manageiq:manageiq /home/manageiq
-
Deploy the Ansible playbook again.
After restoring from a backup, the Managed services deployment can fail
After you restore Managed services (cam
) from a backup, the deployment instance can fail with a socket hang up
error.
If this error occurs, restart the cam-iaas
pod by running the following command:
oc delete pod <cam-iaas-xxxx> -n <namespace>
Where <namespace>
is the project (namespace) where Infrastructure Automation is installed, and <cam-iaas-xxxx>
is the name of the cam-iaas
pod to restart.
With this restart, the service deployment can complete succeessfully.
UI console
- Tour icon disappears when browsing to another console page while a guided tour is running
- About page does not show the correct version of IBM Automation
- 'Data cannot be displayed': error given for 'Defined Applications' and 'Favorite Applications' tiles on the Home page
- User credential timeout starts in the backend
Tour icon disappears when browsing to another console page while a guided tour is running
When you start a guided tour and navigate away from the page to the IBM Automation Home page, the Tour icon might not display on the toolbar. This behavior can occur when an IBM Cloud Pak for Watson AIOps tour is still running. Only one guided tour can run at a time. To resolve this issue, return to an IBM Cloud Pak for Watson AIOps page, click on the Tour icon and close a tour. When you return to the IBM Automation Home page, the Tour icon reappears.
About page does not show the correct version of IBM Automation
If you are attempting to identify the version of an installed IBM Cloud Pak for Watson AIOps, the About page that is accessed from the console toolbar does not show the correct version.
To determine the correct version, use the Cloud pak switcher to access the IBM Cloud Pak | Administration tool. On this tool, find the Cloud Pak deployment summary card and click View details. The side panel opens. Expand the entry for IBM Automation to view the installed instances. The details for the installed instance shows the current deployed version of the IBM Cloud Pak.
'Data cannot be displayed': error given for 'Defined Applications' and 'Favorite Applications' tiles on the Home page
On opening the Console home page neither 'Defined Applications' nor 'Favourite Applications' is listed in their tiles. However, they do exist as they can be viewed under the 'Resource Management' section. The fix for this is to restart the aiops-base-ui pods.
User credential timeout starts in the backend
The first indications of a user credential timeout might be backend failures. For example, failure to load incident, alert, or policy lists. To resolve this issue if it occurs, log out and log back in again, or wait for the frontend logout to occur.
AI Model management and training
- Log parsing assigns messages to catch-all template instead of generating expected template
- Elasticsearch record count does not match record count published to Kafka topic
- Log anomaly detection EXPIRY_SECONDS environment variable not retained after an upgrade
- Log anomalies are not detected by natural language log anomaly detection algorithm
- Metric anomaly detection training does not run on schedule
- In Change risk training, Precheck indicates “Good data” but models fail to create
Log parsing assigns messages to catch-all template instead of generating expected template
If you use catch-all templates for mapping uncategorized messages during AI model training, you can encounter an issue where the log parsing assigns messages for an error to the catch-all templates instead of generating an expected template for that error. If this issue occurs, you might not see expected anomalies.
If you suspect this issue is occurring and you do not see expected anomalies, complete the following steps to manually verify your training templates, and remove any catch-all templates that incorrectly generated.
-
Retrieve the normalized logs from your logtrain indices.
-
Identify the logs that are error logs. Review those logs to determine the template mappings.
-
Retrieve the identified templates from Elasticsearch.
-
Use the error log contents and the template ID from the retrieved normalized logs to identify the template string within the retrieved templates.
-
If the template string is comprised entirely of parameters, or a single word and parameters, the template might be a catch-all template. For example, the following string is an example of a catch-all template:
<>to <><><><><> <> <> <> <>-<>-<> <> <> <> <> <> <> <> <> <> <>
-
Manually delete any catch-all templates.
Elasticsearch record count does not match record count published to Kafka topic
When you push a large training file (for example 60 M records, such as logs or events) to Kafka through your configured connection, the number of records that are ingested and displayed on Elasticsearch might not match. Elasticsearch record count might be lower than Kafka count due to deduplication. If you encounter this issue, split large files into smaller batches and send them individually to Kafka (for example, 5 M records each). When you are pushing a batch, ensure that you wait for an ingest to complete and the associated records display on Elasticsearch before you push the next batch of records.
Log anomaly detection EXPIRY_SECONDS environment variable not retained after an upgrade
If you set a value for the EXPIRY_SECONDS environment variable and upgrade, the environment variable is not retained after the upgrade.
After the upgrade is completed, set the environment variable again. For more information about setting the variable, see Configuring expiry time for log anomaly detection alerts.
Log anomalies are not detected by natural language log anomaly detection algorithm
In some cases a model that has been trained successfully is unable to detect certain log anomalies. The quality of the model is independent of whether it trained successfully, and model quality tends to improve as more training data is available. If the model is not detecting anomalies in your logs, consider training the model again but using additional days of training data to improve the model quality.
Metric anomaly detection training does not run on schedule
If you have metric anomaly detection training scheduled to run, such as daily, you can encounter an issue where the training does not run as scheduled. If the training job does not run on schedule, log in to the IBM Cloud Pak Automation console and click the Metric anomaly detection algorithm tile and then Train models.
In Change risk training, Precheck indicates “Good data” but models fail to create
On rare occasions, a Change risk model fails to create, even though Precheck data indicates that the data is good. This failure is caused by an insufficient number of problematic change risk tickets being available to create a good model. This problem resolves itself when enough tickets become available for the model. (For more information, see Closed change ticket count requirements).
To confirm that insufficient problem tickets is causing the failure, view the Change risk logs on the Luigi pods.
To retrieve the pod:
oc get pod | grep luigi-cr
View the logs for training Change risk models.
oc logs <pod-name> # Ex: luigi-cr-1b5ef57f-9053-4037-95ca-c1e8b8748fc5
Check whether the log contains the following
size of the problematic (aka labels) tickets is insufficient
If confirmed, ensure that enough problematic change tickets are available before training the model again.
ChatOps
- In Slack, a ChatOps communication to IBM Cloud Pak for Watson AIOps times out without establishing a connection
- In a Microsoft Teams ChatOps, the attach template logs feature does not work
- Incidents cannot be reopened or restored in ChatOps
- No Recommended runbooks found in incident overview
- No incident content viewable in Microsoft Teams on mobile devices
- IBM Change Risk Assessment tab in ServiceNow not displaying change risk assessment details
- Updated ChatOps connection fails
In Slack, a ChatOps communication to IBM Cloud Pak for Watson AIOps times out without establishing a connection
When sending a ChatOps communication from Slack to IBM Cloud Pak for Watson AIOps, a known intermittent issue can occur. The communication between Slack and IBM Cloud Pak for Watson AIOps can time out after 3 seconds of no response. A potential solution is to reconfigure your connection in the connection onboarding. Ensure that your Slack Bot can access your IBM Cloud Pak for Watson AIOps instance in a prompt fashion. A more permanent and robust solution to this issue is being devised.
In a Microsoft Teams ChatOps, the attach template logs feature does not work
If you have a Microsoft Teams ChatOps, clicking the Attach template logs button does not work and the logs are not sent to your Microsoft Teams channel for review. As an alternative, use the Preview logs button to view the template logs.
Incidents cannot be reopened or restored in ChatOps
When an incident is closed it is archived and can no longer be modified. If a new alert occurs that is related to the archived incident, a new incident is created instead of reopening the archived incident.
No Recommended runbooks found in incident overview
If a runbook recommends to remediate an incident is deleted from the runbook Library, it does not remove the Recommended runbooks link in a ChatOps notification. This can result in the ChatOps runbook section linking out to an empty runbooks page in the incident overview.
No incident content viewable in Microsoft Teams on mobile devices
On mobile devices, when viewed in Microsoft Teams, incidents can appear with no viewable data. Where this happens, switch to using a computer to see the full incident data.
IBM Change Risk Assessment tab in ServiceNow not displaying change risk assessment details
In a proactive ChatOps channel, you can click the Change request ticket URL to view its details in a ServiceNow instance. In some cases, there might be no details displayed in the IBM Change Risk Assessment tab.
Updated ChatOps connection fails
When the credentials for a ChatOps connection are updated, a failure in the caching mechanism can cause the old app credentials to be used rather than new credentials.
- If this issue occurs for a Slack ChatOps connection, a
channel_not_found
error displays for the connections. - If this issue occurs for a Microsoft Teams ChatOps connection, a
the bot is not part of the conversation roster
error displays for the connections.
To use the updated credentials, restart the chatops-integrator
pod by running the following command in the IBM Cloud Pak for Watson AIOps project (namespace):
For a Slack connection:
oc rollout restart deployment $(oc get deploy -l app.kubernetes.io/component=chatops-slack-integrator -o jsonpath='{.items[*].metadata.name }')
For a Microsoft Teams connection:
oc rollout restart deployment $(oc get deploy -l app.kubernetes.io/component=chatops-teams-integrator -o jsonpath='{.items[*].metadata.name }')
Once the updated pod is running again, the new credentials are used. If errors continue, check that the Slack or Microsoft Teams application is a channel member of the channel ID that was input into the ChatOps connection form.
Incidents and alerts
- "An error occurred while fetching data from the server"
- Alerts tab shows "An unknown error occurred" error when all alerts are closed
- In the metric anomaly details chart PNG and/or JPG export does not appear to work
- Closed incidents are missing details and displaying a critical error
- Metric anomaly chart unexpectedly changes from zoom view to normal view
- Unable to add metric anomaly in Related alerts to chart
- Limitation of preview text for default recommended action
- Extra anomalies appear in Alert Viewer after 3.6.1 to 3.7 update
- Alert right-click menu items cannot be added if any actions have parameters with type 'array'
- 'Error 500' message displayed under Seasonality or Temporal correlation in Alert details side panel
- Metric search page chart lines are disjointed
"An error occurred while fetching data from the server"
When viewing the Incidents UI or when creating a policy to assign runbooks to alerts, you might see the message "An error occurred while fetching data from the server". Or in the Alert Viewer, you might see "An unknown error
occurred". If you encounter these error messages, complete the following steps to delete the aiops-ir-ui-api-graphql
pod. The pod is then automatically re-created, which should resolve the error.
-
Log in to your cluster by running the oc login command.
oc login -u kubeadmin -p <password>
-
Delete the aiops-ir-ui-api-graphql pod.
oc delete pod -l component=aiops-ir-ui-api-graphql -n <cp4waiops_namespace>
-
Wait for the pod to restart.
Alerts tab shows "An unknown error occurred" error when all alerts are closed
If you are viewing a closed incident that has all associated alerts resolved, you can encounter an error when you view the Alerts tab. This "An unknown error occurred" error displays when there are no associated alerts. You can ignore this error message as the incident and alerts are resolved and closed.
In the metric anomaly details chart PNG and/or JPG export does not appear to work
This can occur intermittently, especially in the Firefox browser. The export time can be slower than normal so it could be that it takes a little longer for the export to complete. However, if nothing happens, try again later. Alternatively, use a different browser such as Google Chrome.
Same runbook status for multiple alerts in an incident
If more than one alert meets a runbook policy's conditions, the same runbook can be assigned to multiple alerts in an incident. From the incident overview page, you can select an alert and run an associated runbook. The runbook Status of the selected alert will be updated on the UI. However, the runbook status of other alerts in the incident might be updated with the same status. This is a known issue.
Closed incidents are missing details and displaying a critical error
Incidents with status of "Closed" are missing topology information on the incident Overview tab. The associated alerts of the closed incidents are also missing from Alerts tab. The following critical error is displayed on the Topology tab for closed incidents: "No resource exists with the specified identifier and time point".
Metric anomaly chart unexpectedly changes from zoom view to normal view
This can occur when a related alert is selected and added to a chart you are zoomed in on. The chart resets to normal view. To resolve, zoom in again after related alert is added to chart.
Unable to add metric anomaly in Related alerts to chart
In some cases, when you click the checkbox in the Related alerts, it does not add the related anomaly to the metric anomaly chart.
Limitation of preview text for default recommended action
When an alert is generated from any default log anomaly detection models, the preview of a recommended action might contain partial texts and does not reflect the full view of the recommended action. The first 4000 characters are extracted from the original resolution or action document webpage where possible, from which nonreadable texts such as URLs are excluded to form the content of the preview text.
Extra anomalies appear in Alert Viewer after update from 3.6.1
After an update from IBM Cloud Pak for Watson AIOps 3.6.1, you might notice that extra anomalies appear in the Alert Viewer. These are generated by the SimpleRobustBounds
algorithm. You can confirm this by clicking the suspect
alert, which opens Alert details > Information. If it is a SimpleRobustBounds
anomaly, in sender field, component
contains SimpleRobustBounds
and in insight.anomaly field SimpleRobustBounds
appears in algorithmNames
.
To avoid or resolve this problem, run Metric anomaly detection training as soon as possible after the 3.7 update.
Alert right-click menu items cannot be added if any actions have parameters with type 'array'
When automation actions exist that have any parameters of type 'array', the Add menu item button under menu configuration does not work as this type is not supported. The issue is limited to actions of type Ansible and Powershell which allow parameters of type 'array'.
To avoid or resolve this problem, identify actions with 'array' type parameters, and convert them to non-array parameters or delete them.
'Error 500' message displayed under Seasonality or Temporal correlation in Alert details side panel
On upgrading from IBM Cloud Pak for Watson AIOps 3.7.x to 4.1.0, you might encounter an 'Error 500' message in the Alert Viewer side panel where seasonality or temporal correlation details should be displayed.
To avoid or resolve this problem, delete the associated alert seasonality policy or temporal grouping policy and rerun the AI training. For more information, see Managing AI modelling.
Metric search page chart lines are disjointed
In the Metric search page chart, the normalised forecast line is disjointed from the baseline data line. This is because the forecast data is normalised independently from the baseline data. Although the lines might not match up, the values shown in the tooltips are correct.
Policies
- Condition Value field changes
String:alert
toValue of:alert
- Breadcrumb navigation missing in policy editor
- Condition Matches field for numeric Operator selections
- Last run time and Matched count is updated in policies other than the trigger policy
- Double scroll bars on browser window
- Policy processing failing due to long policy conditions
- Upgrading customizations to preset policies are lost
- No ServiceNow ticket created by incident creation policy
- Datarouting pods can fail and need to be restarted
- Using default values in automatically run runbooks not working
- Netcool alert not suppressed when X in Y suppression policy conditions are met
Condition Values field changes "String:
" to "Value of:
"
For example, in a policy condition if the string "alert.id
" is typed in the Values field, and then "String:alert.id
" is selected, it is changed to "Value of:alert.id
".
To prevent this, avoid a string that exactly matches the keyword. In this example, use the following condition instead:

Note: This example is not an exact match of the string alert.id. This workaround finds all summaries containing alert. and .id.
Breadcrumb navigation missing in policy editor
In some cases where a policy name is long, the breadcrumb navigation in the top left of the policy edit session can be abbreviated. Clicking the breadcrumb still returns you to the Policy UI.
Condition "Matches" field for numeric "Operator" selections
When using a numeric Operator in a policy condition set (for example, greater than, less than, greater or equal) all options can be selected under Matches. However, always select Only for use with a numeric operator.
"Last run" time and "Matched" count updated in policies other than the trigger policy
In a case where an alert meets the incident-creation conditions of multiple policies, only one incident is created. However, all policies that proposed an incident, and the system incident creation policy has the same Last run times on the Policies hub. Each of these policies also increment their Matched counts by 1 in the Details tab of the side panel.
Double scroll bars on browser window
An extra scroll bar might appear on the right side of the browser window.
Policy processing failing due to long policy conditions
Policies that have conditions that are long (100,000+ characters) can cause policy processing to fail, resulting in no alerts or incidents being created. Failure can occur when temporal correlation training generates groups containing many alerts, and those alerts have long resource or type fields.
If this problem occurs, such policies should be disabled from the automation policy UI. Identify the policies by
- Filtering for analytics policies.
- Sorting by last run time (to identify those that have been processed recently and are likely triggering the problem).
- Viewing the specification for each to see whether any have a long condition.
Any policies that show up by implementing these steps should be disabled.
Upgrading: customizations to preset policies are lost
On upgrading, any renamed preset (or default) policies revert to their default policy name. Additionally, any customizations made to the following two preset incident creation policies will be lost:
- Default incident creation policy for high severity alerts
- Default incident creation policy for all alerts
No ServiceNow ticket created by incident creation policy
An alert meets the incident-creation conditions of a policy, but no ServiceNow ticket is associated to the incident.
To avoid this, complete the following steps under actions in the policy editor:
- Click Assign and notify.
- Select In an existing channel.
Datarouting pods can fail and need to be restarted
After installation, data for display in the UI, such as in the policy list, can be missing or stale.
If this issue occurs, restart the datarouting pods. You can identify these pods by using the following command:
oc get pods | grep datarouting.
Using default values in automatically run runbooks not working
This problem can occur by selecting a Default parameter value when creating a policy to assign runbooks to alerts. The useDefault
value is not passed during automatic execution but is passed during manual execution.
You can execute the runbook manually from the Runbooks page.
Netcool alert not suppressed when X in Y suppression policy conditions are met
If you create an X in Y suppression policy that matches an alert originating from an IBM Tivoli Netcool/OMNIbus environment, the alert will not be suppressed.
Secure Tunnel
Secure Tunnel connector is not running after a restart with Podman
This issue can occur when you install the Secure Tunnel connector to a host machine on which Podman is installed. When the host machine is rebooted and the Secure Tunnel connector is checked by using the podman ps -a
command,
the Secure Tunnel connector container does not display running
status.
If this issue occurs, the podman-restart
service must be activated by using the systemctl
command:
systemctl start podman-restart
systemctl enable podman-restart
After entering the command, check podman-restart
worked by using the following command:
systemctl status podman-restart
If the Connector is still not running, try restarting the host machine.
Runbook Automation
When alert.suppressed
value is used, runbook does not automatically run
Normally, you can select a runbook and configure it to run automatically: when an alert is converted to an incident, the runbook is assigned and runs automatically. However, if the parameter value alert.suppressed
is used, the
runbook does not run automatically as it reads this as a Boolean value rather than a string value. Therefore, it is necessary to manually run the runbook.
AIOps insights
- Number of Incidents reported in Noise reduction chart is inconsistent with number of Alerts and Events
- AIOps insights dashboard fails to load even when data is available
- Events not showing up on Noise reduction chart
Number of Incidents reported in Noise reduction chart is inconsistent with number of Alerts and Events
In the AIOps insights dashboard, the Noise reduction chart normally indicates the number of alerts, events, and incidents reported over a specific time period. However, the inclusion of historical data – containing longstanding, unresolved alerts – can result in a skewering of the data that is presented on the chart: the number of incidents that are presented can outnumber the number of alerts and events. Normally, this number is less than either – alerts reduce to a smaller number of events and events reduce to a smaller number of incidents.
The anomalous incident number happens because the reduction time frame covers alerts and events that are generated in the time period selected (for example, 7 Days). However, the incidents are generated from all outstanding alerts, including alerts that are not resolved, historically: alerts that occurred before the selected time period. So, in these circumstances, while the number of alerts and events is correct, the number of incidents is not.
AIOps insights dashboard fails to load even when data is available
Large amounts of data can cause the dashboard to fail to load or time out with the message Error – Metrics unavailable
displayed for each chart. The problem is a scaling issue. The AIOps insights dashboard is
not yet developed enough to handle huge amounts of data. A possible workaround is to increase resources for insights-api
and elasticsearch
pods. However, this approach might not be successful.
Events not showing up on Noise reduction chart
The charts in AIOps insights cover a timeline no greater than 30 days. The dashboard reads the firstOccurenceTime
value from only within that period. If an alert was created outside of that timeline, and deduplicated,
it is not added to the eventCount
in the AIOps Insights Noise reduction chart. In this scenario, the eventCount
for the alert increments in IBM Cloud Pak for Watson AIOps, but not in the Events segment of the Noise reduction chart.