Known issues and limitations
Review the known issues for IBM Cloud Pak® for AIOps.
Also, review the troubleshooting documentation for more information on common issues. For more information, see troubleshooting.
- Install and upgrade
- Backup and restore
- Access control
- Identity Management
- Observers and integrations
- IBM Cloud Pak for AIOps and IBM Netcool Operations Insight
- Applications and topologies
- Infrastructure automation
- UI console
- AI model management and training
- ChatOps
- Incidents and alerts
- Policies
- Secure Tunnel
- Runbook Automation
- AIOps Insights
- Ticketing
Install and upgrade
- Unable to reach inventory service
- Limitation on number of instances
- Manual resource adjustments are not persisted
- Services fail to connect to Cassandra
- The
ibm-aiops-orchestrator
pod throws anOOMKilled
error - Automatic approval required for installation
- Elasticsearch health status yellow after restoring from a backup
- ChatOps Microsoft Teams integration does not work with a proxy server
- Kafka topics cannot be listed with oc get kafkatopics
Also, review the troubleshooting documentation for more information on common issues. For more information, see troubleshooting.
Unable to reach inventory service
If your system is installed in the same namespace as IBM Cloud Pak® for Network Automation, the connection to the inventory service breaks. This is caused by a clash with another service with the same name.
Workaround: Use the config.yaml file to enforce the correct values for INVENTORY_SERVICE_HOST and INVENTORY_SERVICE_PORT.
Example:
apiVersion: v1
kind: ConfigMap
metadata:
name: noi-topology-sizing
namespace: namespace
data:
asm: |
ui-api:
containers:
ui-api:
env:
- name: INVENTORY_SERVICE_HOST
value: noi-topology-inventory.namespace.svc
- name: INVENTORY_SERVICE_HOST
value: "9178"
Limitation on number of instances
IBM Cloud Pak for AIOps and Infrastructure Automation can co-exist on the same cluster, but you cannot have multiple instances of IBM Cloud Pak for AIOps or Infrastructure Automation on the same cluster.
Manual adjustments are not persisted
Custom patches, labels, and manual adjustments to IBM Cloud Pak for AIOps resources (such as increased CPU and memory values) are lost when an event such as upgrade, pod restart, resource editing, or node restart triggers a reconciliation. Reconciliation causes any manually implemented adjustments to be reverted to their original default values. Depending on the parameters that you want to adjust, you might be able to use a custom profile to persist your changes. For more information about custom profiles, see Custom profiles.
Services fail to connect to Cassandra
After you install IBM Cloud Pak for AIOps for a production environment deployment, various services might not be available due to connection issues with Cassandra. To resolve this issue if it occurs, restart Cassandra and the schema creation pods.
The ibm-aiops-orchestrator
pod throws an OOMKilled
error
If your environment has many secrets and ConfigMaps, when the ibm-aiops-orchestrator
(lead operator) attempts to build its cache, the operator can exceed its memory allocation and cause a Kubernetes out-of-memory error for the
container. This error can prevent the IBM Cloud Pak for AIOps installation from reconciling, blocking the installation from completing.
If you encounter this issue, the operator requires more memory resources to build its cache. Override the subscription resource to increase the memory limits for the pod and avoid the out-of-memory issue.
Automatic approval required for installation
The use of manual approval strategies for InstallPlans in a project (namespace) can affect the IBM Cloud Pak for AIOps installation.
For instance, if you use manual approval for any of your InstallPlans to install operators in All Namespaces
mode (cluster
scope), the manual approval can affect your install. The installation of IBM Cloud Pak for
AIOps requires automatic approval to be used.
Elasticsearch health status yellow after restoring from a backup
When you are restoring an Elasticsearch backup to a new single node Elasticsearch cluster, the Elasticsearch database might not work as expected. Instead, the Elasticsearch cluster health shows that a yellow status after the restore completes.
ChatOps Microsoft Teams integration does not work with a proxy server
If you have an offline (air-gapped) deployment of IBM Cloud Pak for AIOps or an environment that uses a proxy server, then you cannot use the ChatOps Microsoft Teams integration. The use of a proxy with the Chatops Microsoft team integration is not supported.
Kafka topics cannot be listed with oc get kafkatopics
Kafka topics are created by using the KafkaTopic custom resource or API. Only topics that are created by custom resource are shown when running oc get kafkatopics
. To see a full list of all the Kafka topics use the following steps:
- Install kcat.
- Get the
waiops-mustgather.sh
script. For more information, see Installing the IBM Cloud Pak for AIOps MustGather tool. - Run
waiops-mustgather.sh -V kafka-topics
Backup and restore
Backup and restore of IBM Cloud Pak for AIOps with older versions of IBM Fusion does not work with Portworx
If you are using Portworx as your storage provider, then IBM Fusion must be v2.9.0, or backup and restore will fail.
Access control
- Sharing topology URLs overides group restrictions
- Automation Analyst role unused in IBM Cloud Pak for AIOps
- Users from a user group remain after user group is deleted
- UI redirects to unexpected page when logging in after a session timeout
- Users in a user group are not listed under the Manage assignees pane of the Incidents and alerts page
Sharing topology URLs overides group restrictions
When users with different group profiles share topology URLs, for example in a scenario where user one can see and view a group-based topology including its members, but user two should be restricted from seeing that group, that restriction is ignored if user one shares a URL with user two.
Workaround: The administrator can create a redaction policy to ensure that details are hidden for the group members.
Automation Analyst role unused in IBM Cloud Pak for AIOps
By default, an Automation Analyst role is displayed within the IBM Cloud Pak for AIOps console Access control page when you are assigning a role to a user. This default role is used within the IBM® Automation family of offerings, which includes IBM Cloud Pak for AIOps, however, this role is not used within IBM Cloud Pak for AIOps.
This role does not include or provide any permissions within IBM Cloud Pak for AIOps and should not be assigned to users within IBM Cloud Pak for AIOps.
Users from a user group remain after user group is deleted
When you delete a user group, the users that were included in the group remain in your list of users. Any role that is inherited through the deleted user group is removed from the users. If the users were assigned roles individually, they continue to have those roles and can continue to log in to the UI console and complete tasks. If the users that were in the deleted user group need to be removed completely, an administrator needs to manually remove the users. Users can be removed by clicking the Delete icon for the user's entry within the list of users on the Access control Users tab.
UI redirects to unexpected page when logging in after a session timeout
After a session timeout occurs and a user logs in to the UI console again, the user can be redirected to a different page than the page that they were on when the timeout occurred. For instance, a user that was working on the AI Model Management training page when their session timed out might be redirected to a graphql playground page after logging back in. This redirect occurs because the UI uses the last request URL that included the expired token to identify where to redirect the user when the user logs back in. If this redirect occurs, the user needs to manually return to the expected page in the UI to continue working.
Users in a user group are not listed under the Manage assignees pane of the Incidents and alerts page
When you have users within a user group and view the Manage assignees pane of the Incidents and alerts page, you might not see some users who are listed. This error can occur when the users from the LDAP user group are not individually onboarded. To verify whether a user is onboarded, go to the Access control > Users tab and check whether the user is listed. If the user is not listed, that user must first log in to the console, which validates their roles and permissions. After logging in, the user can display in the list of users and on the Manage assignees pane.
The Manage assignees pane is viewable from the list of all Incidents. Select an incident and then click Manage assignees. After you select an existing user group, you should see the included users who are listed.
Identity Management
The IBM Cloud Pak foundational services Identity Management (IM) service is used by IBM Cloud Pak for AIOps. This service includes the following known issues and limitations:
Note: Some descriptions of the listed known issues in Identity Management are shortened.
- Login failure in Platform UI console while upgrading foundational services version 3.22 or version 3.23 to foundational services version 4.x.x.
- LDAP user names are case-sensitive.
- The OpenShift group does not synchronize when a user is added or removed from an LDAP group.
- The OpenShift users are not removed when you remove them from the LDAP group.
- You cannot onboard OpenShift user group to Identity Management as the groups property of the
user.openshift.io
API is deprecated.
For more information about all the known issues and limitations that are related to Identity Management, see Known issues in foundational services.
Observers and integrations
- Kubernetes Observer job fails to restart after OOM
- ServiceNow Observer UI displays superfluous character
- 'Failed to read certificate' error
- Duplication of resources in the topology if certain observer job parameters are changed after a job has been run
- Incomplete historical data processing in the event of integration pods restarting
- File and Rest observer topology service location URL not accessible
- Integration console displays special characters incorrectly
- Turbonomic integration with IBM Cloud Pak for AIOps affects other integrations in Turbonomic
- No notification in IBM Cloud Pak for AIOps on Turbonomic actions closed without execution
- New Relic observer does not support dashboard tokens for new users
- All dates and time are in US-en format
- Appdynamic historical start date and time cannot be older than 4 hours
- Appdynamic live mode aggregation interval is 1 minute
- The observer-service pod is in crash back loop due to ghost vertex
- Alerts for Instana without associated topologies
- Alerts for Instana topology not mapping properly
- Instana metric collection API rate limit exceeded error
- IBM Cloud Pak for AIOps cannot close some alerts when an Instana integration exists
- Delay query time for integrations
- When creating a kubernetes observer job that is named 'weave_scope', 'load', 'kubeconfig', or 'local', the job fails to run
- Log Anomaly 8k limit on field mapping in details field of the alert schema
- Log integration does not start when multiple log integrations are active
- After a restore, data from integrations is not processed
- Historical data from ServiceNow instance gets collected only when the historical data flow is reenabled
- Dynatrace integration pod restarted and does not retrieve all historical data
- ServiceNow user account gets locked out after a few hours
- Scale resources when running log anomaly training on large data
- The integration status for Elk, Custom Logs, Mezmo, and Falcon LogScale sometimes show 'not running' even though the Flink job and gRPC pod are running correctly
- Log data integrations status is "Done" even though historical data is still loading
- IBM Tivoli Netcool/Impact integration stops event processing with exceptions
- Impact integration fails for Impact server with non-default cluster name
- Connector experienced a failure due to a Bad Request for connection when data flow was enabled
- GitHub connector issues with similar tickets and adding assignees to mappings
- ServiceNow ticket contains too much text
- ServiceNow user account locked out
- ServiceNow observer status given as
Unknown
for an extended period - Integrations unable to pause and resume both inbound and outbound data flow
- Dynatrace topology does not support proxy target system
- Error messages in the agent log when the Dynatrace integration is disabled
- Dynatrace topology unable to create a second instance on the same cluster
- Output to Db2 via the connector is not keeping up
- When using the Db2 integration, data may not be in sync when using an update trigger in a policy
- Db2 integration shows a status of Running regardless of whether correct login details have been specified
- Integrations stuck in Initializing state
- ServiceNow incidents are missing on historical pull
- Topology status processing of Instana events slows over time
- Instana connector not correctly generating event clears for events incoming
- New Relic observer job fails with an authorization error
Kubernetes Observer job fails to restart after OOM
Kubernetes Observer jobs with very large payloads can encounter an OOM (out-of-memory) error, after which they may fail to restart. The observer appears offline, but a health check fails to flag any errors.
Workaround: Restart the observer if it appears as offline in the UI.
ServiceNow Observer UI displays superfluous characters
If the Entity_mapping field is updated from the Observer UI, superfluous curly brackets and quotations are displayed in the Resource type-mapping field.
Workaround: This is a cosmetic issue and can be ignored.
'Failed to read certificate' error
This error can occur when an observer attempts to create an SSL certificate and the endpoint server does not respond.
In the example error message below, a vmvcenter.crt certificate error occurs because the endpoint server does not respond.
Failed to read certificate for field [certificate].
The file 'vmvcenter.crt' could not be found under /opt/ibm/netcool/asm/security/.
Workaround: Ensure that the endpoint server is running correctly.
Duplication of resources in the topology if certain observer job parameters are changed after a job has been run
Certain resource parameters are used to uniquely identify a resource. If one of these parameters is changed after the initial job run, then any subsequent job run will result in duplicate records. For example, if the parameter of 'hostname' is replaced with 'Ipaddress' after a topology has been created, a subsequent discovery will consider the resource as new, and create a duplicate record.
The following resource parameters uniquely identify a resource. Changing them after the initial job has been run will result in duplicate records.
Workaround: If you need to modify these values, do not modify the existing job. Instead, create a new job.
Observer | Job parameter |
---|---|
ALM | n/a |
AppDynamics | account |
AWS | region , dataTenant |
Ansible AWX | host, user |
Azure | data_center |
BigFix Inventory | data_center |
Big Cloud Fabric | proxy-hostname , proxy-username , bcf-controllers |
Ciena Blue Planet | data_center , tenant |
Cisco ACI | tenant_name |
DNS | addressTypes , server , port , recurse |
Docker | endPoint.port |
Dynatrace | datatenant , hostname |
File | provider , file |
GitLab | datatenant , hostname |
GoogleCloud | project_id |
HPNFVD | datacenter , username , cnf_job |
IBM Cloud | instance , username , region |
ITNM | instance |
Jenkins | jenkins_observation_namespace |
Juniper CSO | cso_central_ms_url , user_domain_name , domain_project_tenant_name |
Juniper Contrail | api_server_url , os_project_name , os_tenant_name |
Kubernetes | data_center , namespace |
NewRelic | accountName , accountId |
OpenStack | data_center , os_project_name |
Rancher | accessKey , clusterId |
REST | provider |
SDC ONAP | host , username |
ServiceNow | instance_url , username |
SevOne | datatenant , hostname |
TADDM | api_url , username |
Viptela | data_center |
VMware NSX | data_center |
VMware vCenter | data_center |
Zabbix | data_center |
Incomplete historical data processing in the event of integration pods restarting
If you create an integration to collect historical data for Metric Anomaly AI Training, you might come across an issue where the integration pod restarts, but does not retrieve all historical data for training. As a result, you might suffer data loss.
An integration pod can restart due to outages, the target system crashing, or pod crashes in the environment. This issue can occur intermittently, depending on the number of metrics that are selected for the integration and the amount of data to be retrieved.
File and Rest observer topology service location URL not accessible
When creating an edge via either the File or Rest observers, the POST request returns a Topology service location URL that is not accessible. The URL cannot be used to manage the edge because the relevant API is not exposed. Workaround: None
Integration console displays special characters incorrectly
If you use special characters in the Name and Description fields while creating or editing an integration, the console Integrations page might display the special characters incorrectly. Nevertheless, the integration is saved.
Turbonomic integration affects other integrations in Turbonomic
The Turbonomic integration with IBM Cloud Pak for AIOps enables actions that are created or executed in Turbonomic to be notified in IBM Cloud Pak for AIOps through the enabled webhook workflow. However, Turbonomic allows only one webhook workflow per action. Therefore, other integrations that are enabled in Turbonomic, like ServiceNow, might not get any notification when actions are created or executed in Turbonomic.
No notification in IBM Cloud Pak for AIOps on Turbonomic actions closed without execution
IBM Cloud Pak for AIOps does not receive any notification from Turbonomic for actions that are closed without being executed. For example, an action related to an erroneous condition that is no longer occurring gets automatically closed in Turbonomic. But its corresponding IBM Cloud Pak for AIOps alert remains open indefinitely and must be cleared manually from the console.
New Relic observer does not support dashboard tokens for new users
For new users of the New Relic observer, the observer does not work as it no longer supports the New Relic One dashboard token. However, it will continue to work for existing users who are using the old token that was generated previously through the old dashboard.
All dates and time are in US-en format
When you are scheduling data collection for a integration, all dates and times are presented in the US-en formats:
- All dates are configured and presented in the
mm/dd/yyyy
format. - All times are configured and presented in the
hh:mm
AM
/PM
12-hour clock format.
You cannot switch the date or time format.
Appdynamic historical start date and time cannot be older then 4 hours
Historical start date and time is configurable, but if you set it to beyond the past 4 hours, then the integration will ignore it and only retrieve the past 4 hours of data.
Appdynamic live mode aggregation interval is 1 minute
In live mode, the new aggregation interval allowed is only 1 minute.
The observer-service pod is in a crash back loop due to a ghost vertex
If you notice that the topology observer-service pod is not functioning correctly and that restarting the pod does not correct the issue, a ghost vertex might need to be removed. To remove the vertex, you need to traverse an edge to the vertex, and then delete the vertex. To traverse to the vertex, use the type vertex and definesType edge.
-
Run the following command to find the ID for the type vertex.
oc exec -it <topology pod> -- curl -X GET --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/types?_filter=keyIndexName=ASM::entityType::mgmtArtifact::ASM_OBSERVER_JOB' -u <username><password> --insecure
Where
<username>
- Your Topology IBM Cloud Pak for AIOps API username<password>
- Your Topology IBM Cloud Pak for AIOps API password
-
Run the following command to use the definesType edge to get the ID for the vertex that is causing the issue.
oc exec -it <topology pod> -- curl -X GET --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/resources/<type vertex ID>/references/out/definesType' <username><password> --insecure
Where
<type vertex ID>
- The ID for the type vertex that you retrieved in step 1.<username>
- Your Topology IBM Cloud Pak for AIOps API username<password>
- Your Topology IBM Cloud Pak for AIOps API password
-
Run the following command to delete the vertex.
oc exec -it <topology pod> -- curl -X DELETE --header 'Accept: application/json' --header 'X-TenantID: 01abea99-8dff-7f71-bef3-09136b6a4ff0' 'https://localhost:8080/1.0/topology/resources/<type vertex ID>/references/out/definesType?_delete=nodes&_delete_self=false&_node_label=vertex' -u <username><password> --insecure
Where
<type vertex ID>
- The ID for the type vertex that you retrieved in step 1.<username>
- Your Topology IBM Cloud Pak for AIOps API username<password>
- Your Topology IBM Cloud Pak for AIOps API password
Delay query time for integrations
If you set an integration to retrieve Live data for continuous AI training and anomaly detection or Live data for initial AI training, you might need to configure a delay to offset the query time window to provide a time buffer for preventing the partial retrieval of real-time data. You cannot configure this delay within the UI console. You must use a command line to configure the delay. For more information about configuring this delay, see Delay configuration in data integrations.
Alerts for Instana without associated topologies
In some cases Instana alerts will not have associated topologies. In most cases this happens because the resource that originated the event is no longer available in Instana. For example, a pod that triggers an Instana event can be redeployed by the underlying kubernetes engine.
Alerts for Instana topology not mapping properly
In some cases alerts are not mapped correctly to a corresponding Instana topology node. For example, alerts generated from log anomaly detection or metric anomaly detection (or other sources) might not show as associated with a Instana topology node.
As a workaround, you need to define your own match rules to correlate with the source data. To define a match rule, click on Resource Management, then click Settings, Topology Configuration, and finally Configure on the Rules Tile. When you are configuring the match token values to use, the values depend on the data that you are sending to Instana.
Instana metric collection API rate limit exceeded error
The recommended rate limit will be double the number of resources. There might be situations where the limits need to be increased. When a more precise limit is required, use the following formula to determine the limit to use:
number-of-metric-API-calls-per-hour ~= (number-of-selected-technologies x 2) x (snapshots-for-selected-technologies / 30) x (60 / collection-interval)
number-of-topology-API-calls-per-hour ~= (number-of-application-perspectives x (60 / collection-interval)) +(number-of-services x (60 / collection-interval))
number-of-events-API-calls-per-hour = 60
total= number-of-metric-API-calls-per-hour + number-of-topology-API-calls-per-hour + number-of-events-API-calls-per-hour
Note: Each plugin can have a different number of metrics collected. The mean value across these is used, which is 2 collection cycles per plugin. If the environment is unbalanced, for instance you have mostly hosts that define most metrics, then the formula might underestimate the required limit.
To determine the number of resources (snapshots) for each infrastructure pluginAPI, use the following:
api/infrastructure-monitoring/snapshots?plugin=technology_name
Example:
api/infrastructure-monitoring/snapshots?plugin=host
For more information about Instana APIs, see Instana API.
The following example curl commands allow you to retrieve the number of:
-
snapshots-for-selected-technologies (such as host)
curl -k -s --request GET 'https://<instana server hostname>/api/infrastructure-monitoring/snapshots?plugin=host' --header 'Authorization: apiToken <api token>' | jq '.items|length'
-
number-of-application-perspectives
curl -k -s --request GET 'https://<instana server hostname>/api/application-monitoring/applications' --header 'Authorization: apiToken <api token>' | jq '.items|length'
-
number-of-services
curl -k -s --request GET 'https://<instana server hostname>/api/application-monitoring/services' --header 'Authorization: apiToken <api token>' | jq '.items|length'
IBM Cloud Pak for AIOps cannot close some alerts when an Instana integration exists
If you have an Instana integration created and have Instana 221 (Saas or Self-Hosted), you might encounter an issue where IBM Cloud Pak for AIOps might not be able to close some alerts. Instead, you might need to check the status for the associated event within the Instana dashboard and clear the alert manually. For more information, see Troubleshooting integrations: Instana Event integration.
When creating a Kubernetes observer job that is named 'weave_scope', 'load', 'kubeconfig', or 'local', the job fails to run
If you create a Kubernetes observer job with the name weave_scope
, load
, kubeconfig
, or local
, the job always fails to run. When this error occurs, you can view an error icon in the schedule
column for the job. To avoid this issue, do not use these names for the observer job.
Log Anomaly 8k limit on field mapping in details field of the alert schema
The limitation is that the datalayer imposes an 8 kb size limit on the details field in the alert schema. The details field is populated by the log anomaly event, which provides the relevant information to display in slack when trying to view alerts in the chatops. Whenever the details field size exceeds 8 kb, the returned json object is truncated and therefore when the user clicks view alerts to retrieve the alerts related to an incident, expected results are not seen and an error is recorded.
The current fields under the details objects are:
end_timestamp: int
original_group_id: str
causality: dict
detected_at: float
source_application_id: str
log_anomaly_confidence: float
log_anomaly_model: List[str]
prediction_error: dict
error_templates: List[int]
count_vector: List[int]
text_dict: dict
application_group_id: str
application_id: str
model_version: str
severity_from_model: int
description: str
Log integration does not start when multiple log integrations are active
If you have many active log integrations, such multiple Kafka, ELK, Splunk, and Falcon LogScale integrations, and you create and enable another Falcon LogScale integration, you might notice that the integration status is stuck in an error or restarting state. This state can even occur after the integration is operating as expected.
This can occur if you exceed the limit for the number of jobs that can run on the underlying service, which results in insufficent resource available to start the integration. To resolve this issue, complete one or more of the following tasks:
- Increase the replica count of your task managers.
- Increase the task manager count per replica.
- Change the parallelism of your integrations.
- Cancel other integrations
After a restore, data from integrations are not processed
If you have an integration that you are restoring, the status for these integrations can be in error after the restore process completes. To resolve this status, you need to edit and save your integrations with the Integrations in the IBM Cloud Pak for AIOps IBM Cloud Pak for AIOps console. Editing the integration regenerates the associated Flink job for the integration, which updates the status.
Historical data from ServiceNow instance gets collected only when the historical data flow is reenabled
If you enable historical data flow for a ServiceNow integration, you might notice that the historical data is not collected from ServiceNow. For instance, when you check the grpc-snow
pod, you can see ticket data available, but
when you check the Flink job or in Elasticsearch, you can notice that no data was collected. If this issue occurs, turning off the historical data flow and turning it back on can cause the data to begin to be collected.
Dynatrace integration pod restarted and does not retrieve all historical data
If you have a Dynatrace integration created and pull historical data with multiple metrics for Metric Anomaly AI Training, you can encounter an issue where the Dynatrace pod restarts, but does not complete retrieving the expected historical data for training. This issue can occur intermittently, depending on the number of metrics that are selected for the integration and the amount of data to be retrieved.
If this potential out-of-memory or out-of-resources issue occurs, consider creating separate integrations to monitor different and smaller sets of metrics. By splitting the integrations, you can reduce the amount of data to be retrieved through the initial integration that can cause this issue.
ServiceNow user account gets locked out after a few hours
If there is an active ServiceNow integration with data collection enabled and the ServiceNow credentials change, the ServiceNow user account can get locked out. ServiceNow has an automatic login locking script called "SNC User Lockout Check", which locks users out after more than 5 failed attempts (including any failed API calls).
If you check the Incidents and alert page, you will see also an alert saying "ServiceNow instance authentication failed".
When this problem occurs, unlock the user in ServiceNow. Then change the password in the ServiceNow integration and save. When authentication fails in the ServiceNow integration, there is a 1-minute wait time before you can access it, to prevent a lockout from occuring quickly.
Scale resources when running log anomaly training on large data
In some cases it is observed that log anomaly training fails on large data due to being Out Of Memory (OOM) or if there is a problem with ES shards. The solution is to scale up the resources to handle large data training.
For more information about shard management, see About indices and shards. For more information about increasing ES Resources, see Log anomaly training pods CPU and Memory resource management.
The integration status for Elk, Custom Logs, Mezmo, and Falcon LogScale sometimes shows 'not running' even though the Flink job and gRPC pod are running correctly
On creating a integration, the Flink job retrieves data normally and the gRPC pod is running without error. However, the console shows that the integration status is 'not running'.
Log data integrations status is "Done" even though historical data is still loading
When a log data integration (Falcon Logscale, ELK, Mezmo, Custom, Splunk) is running in Historical data for initial AI training mode, and a custom regex is added in the field_mapping
section, the data processing
can take a long time. Although the Data collection status might be shown on the UI as Done, data could still be being processed and written to Elastic in the background.
To speed up this process, you can increase the Base parallelism number that is associated with that integration. For more information, see Increasing data streaming capacity.
IBM Tivoli Netcool/Impact integration stops event processing with exceptions
If you have an IBM Tivoli Netcool/Impact integration, you can encounter an issue where the integration temporarily stops processing during the sending of an event stream to IBM Cloud Pak for AIOps.
This issue can occur when you have an IBM Cloud Pak for AIOps policy that triggers an IBM Tivoli Netcool/Impact policy when certain types of events are received. If this issue occurs and stops the event processing, the Impact integration logs or Impact policylogger logs can include messages that are similar to the following example exceptions:
[6/14/23, 11:38:45:816 UTC] 0000005d ConnectorMana W failed to send status update
...
[6/14/23, 11:38:45:815 UTC] 000023ca StandardConne W configuration stream terminated with an error
...
[6/14/23, 11:38:45:816 UTC] 000023cc GRPCCloudEven W consume stream terminated with an error: channel=cp4waiops-cartridge.lifecycle.output.connector-requests
If you encounter this issue, you might need to restart the impact-connector
pod to begin the processing of the event stream again.
IBM Tivoli Netcool/Impact integration fails for IBM Tivoli Netcool/Impact server with non-default cluster name
If the IBM Tivoli Netcool/Impact cluster uses a non-default cluster name ("NCICLUSTER"), the integration can fail to validate the integration. The IBM Tivoli Netcool/Impact server may report DynamicBindingException
errors
in the impactgui.log:
com.micromuse.common.nameserver.DynamicBindingException: DynamicBindingException: Service [NCICLUSTER] not in nameserver.
To resolve the issue, wait for the backend IBM Tivoli Netcool/Impact server to finish initializing before starting or restarting the Impact GUI server.
If IBM Tivoli Netcool/Impact is running fix pack 7.1.0.26 or later, you can also resolve the issue by setting the nameserver.defaultcluster
property in the GUI server. Add the following line to $IMPACT_HOME/etc/nameserver.props:
impact.nameserver.defaultcluster=CLUSTERNAME
where CLUSTERNAME is the name of the IBM Tivoli Netcool/Impact cluster.
IBM Cloud Pak for AIOps and IBM Netcool Operations Insight
- IBM Cloud Pak for AIOps Strimzi Kafka topics created without replication
- IBM Cloud Pak for AIOps pods not starting after a cluster restart
- NOIHybrid incorrectly listed in provided APIs for IBM Netcool Operations Insight operator
- Rare issue: unable to deploy log anomaly detection model
IBM Cloud Pak for AIOps Strimzi Kafka topics created without replication
IBM Cloud Pak for AIOps supports multiple replications of Kafka topics for large production installations, such as for data redundancy. The IBM Cloud Pak for AIOps console can automatically create Kafka topics when integrations are created.
When a topic is dynamically created by the IBM Cloud Pak for AIOps console, the replication is set to 1
in the controller. As such, Kafka topics the are created during installation can have multiple replicates, but those topics
that are created dynamically do not.
If you are implementing a production (large) deployment of IBM Cloud Pak for AIOps, you might lose data if your Kafka pods fail or restart. If the data flow is enabled in your Kafka integration when the Kafka pods go down, you might experience a gap in the data that your integration generated during that down period. Upgrades or updates to workers can cause a Kafka broker restart.
You can manually modify the Kafka topic replication inside the Kafka container from a value of 1
to 3
to mitigate any potential data loss from this issue.
IBM Cloud Pak for AIOps pods not starting after a cluster restart
When the cluster is restarted, all nodes have a STATUS of Ready. Most pods return to a STATUS of Running except for some IBM Cloud Pak for AIOps pods.
One potential cause is that Elasticsearch must be up and running before the IBM Cloud Pak for AIOps pods can start.
Restart the Elasticsearch pod to get all pods back to a STATUS of Running.
NOIHybrid incorrectly listed in provided APIs for IBM Netcool Operations Insight operator
NOIHybrid is incorrectly included in the Provided APIs list for the IBM Netcool Operations Insight operator. This list is displayed in the Red Hat OpenShift Container Platform web console under Installed Operators > Netcool Operations Insight > Operator Details. Do not use NOIHybrid APIs.
Rare issue: unable to deploy log anomaly detection model
Very occasionally, following successful completion of log anomaly detection training, an error similar to the following is displayed in the AI management UI training page following an attempt to deploy the model.
Error
Model deployment failed
Within the error textbox, you will also see the text "Forbidden".
If you investigate the aiops-ai-model-ui pod logs, you will also see the following error.
ForbiddenError: invalid csrf token
If this occurs, first refresh the browser and try to deploy again.
If that does not remedy the situation, then log out and log back in, and then try to deploy the model again.
Connector experienced a failure due to a Bad Request for connection when data flow was enabled
When data flow is enabled for the Splunk connector, the connection fails with a Bad Request for connection
error.
Workaround: Check that the models are deployed correctly. Then deploy the models in the AI hub UI.
GitHub connector issues with similar tickets and adding assignees to mappings
The following known issues have been obeserved with the GitHub connector:
- In the Incident Overview > Add tickets to this incident panel, GitHub issues are missing the Updated by information. Additionally, searching for GitHub similar past resolution tickets from the Source drop-down menu in this panel will not display any tickets.
- The GitHub connector might be missing from the list of integrations in training modules such as Similar tickets and Change risk.
- The default issue mappings in a GitHub integration does not have assignees. However, if assignees are added to the mappings issues are not created in GitHub.
ServiceNow ticket contains too much text
If the ServiceNow change request, incident, or problem contains a large amount of text, such as work notes with close to 150,000 characters, the ticket is dropped and a warning is logged in the pod log. Dropping the ticket affects change risk for that ticket, as it will not occur or that ticket will not be used for similar ticket detection.
ServiceNow user account locked out
If there is an active ServiceNow integration with data collection enabled and the ServiceNow credentials change, the ServiceNow user account can get locked out. ServiceNow has an automatic login locking script that is called "SNC User Lockout Check", which locks users out after more than five failed attempts (including any failed API calls).
If you check the Incidents and alert page, you see also an alert saying "ServiceNow instance authentication failed".
When this problem occurs, unlock the user in ServiceNow. Then, change the password in the ServiceNow integration and save. When authentication fails in the ServiceNow integration, there is a 1-minute wait time before you can access it, to prevent a lockout from occurring quickly.
ServiceNow observer status given as Unknown
for an extended period
There is currently a known issue whereby the status of the ServiceNow observer in the UI may be shown as Unknown
for an extended period (up to 15 minutes) following a successful initialization before finally changing to Running
.
This issue is intermittent. It does not happen every time the ServiceNow observer is started, and it does not affect data collection.
Integrations unable to pause and resume both inbound and outbound data flow
Integrations with bi-directional data flow does not completely pause the flow of data both inbound to AIOPs and outbound from AIOps when the data flow is disabled in the UI.
Currently, when data flow is disabled, only the inbound data flow from the event source to IBM Cloud Pak for AIOps is disabled. Outbound data (in the form of actions) are still being pushed from IBM Cloud Pak for AIOps to the event source. This could cause some alerts being out of sync between the two systems.
Workaround: As a workaround you can resynchronize the alerts from Netcool and AIOps using the following steps:
- Disable the existing Netcool connector dataflow in the UI. This is to ensure that the connector releases any file locks before deletion.
- Get the existing Netcool connection ID.
- Delete the Netcool connection from the UI.
- Clear the AIOpsAlertId and AIOpsState columns in Netcool alerts.status table.
- Close AIOps Alerts using the connectionId as a filter.
- Install a new Netcool connection.
Getting the existing Netcool connection ID
-
Get the connector name.
oc get connectorconfiguration -l aiops.connector.type=netcool-connector
-
Set the
connectionname
variable with the connection name andnamespace
variable with the AIOps namespace.connectionname="netcool" namespace="aiops" oc project $namespace connconfig=$(oc get connectorconfiguration --no-headers | grep "$connectionname" | awk '{print $1}') connconfiguid=$(oc get connectorconfiguration $connconfig -o jsonpath='{.metadata.uid}')
-
Make a note of the connection ID value which will be used in the subsequent steps. Make sure the variable is not empty before proceeding to the next step.
echo $connconfiguid
Clearing the AIOpsAlertId
and AIOpsState
columns in Netcool alerts.status
table
-
Login to the ObjectServers using the NCO_SQL
($OMNIHOME/bin/nco_sql)
utility and execute the following command:-- Clear AIOps columns update alerts.status set AIOpsAlertId='',AIOpsState=''; go
-
Exit the NCO_SQL utility.
Closing AIOPs Alerts using the connectionId
as a filter
-
Find one of the
ir-core-ncodl-std
pods.$irCorePod=$(oc get pods --no-headers | grep 'ir-core-ncodl-std' | awk '{print $1}' | head -n 1)
-
Get into the pod terminal.
oc exec -it $irCorePod -- /bin/bash
-
Update the command and replace
<connectionId>
with the value from the previous command. This command will list all alerts matching the filter.curl --insecure -X GET -H "Content-Type:application/json" -H "Accept:application/json" \ --user "${API_AUTHSCHEME_STATICKEY_USERNAME}:$(cat ${API_AUTHSCHEME_STATICKEY_PASSWORD_FILE})" \ "https://localhost:10011/irdatalayer.aiops.io/v1/cfd95b7e-3bc7-4006-a4a8-a73a79c71255/alerts?filter=sender.connectionId%3D%27<connectionId>%27"
Example command with
a4e3c212-5f84-4be3-989b-d3f293f0183e
as thesender.connectionId
filter.curl --insecure -X GET -H "Content-Type:application/json" -H "Accept:application/json" \ --user "${API_AUTHSCHEME_STATICKEY_USERNAME}:$(cat ${API_AUTHSCHEME_STATICKEY_PASSWORD_FILE})" \ "https://localhost:10011/irdatalayer.aiops.io/v1/cfd95b7e-3bc7-4006-a4a8-a73a79c71255/alerts?filter=sender.connectionId%3D%27a4e3c212-5f84-4be3-989b-d3f293f0183e%27"
-
Execute the following command to close alerts matching the
sender.connectionId
filter.curl --insecure -X PATCH -H "Content-Type:application/json" -H "Accept:application/json" \ --user "${API_AUTHSCHEME_STATICKEY_USERNAME}:$(cat ${API_AUTHSCHEME_STATICKEY_PASSWORD_FILE})" \ "https://localhost:10011/irdatalayer.aiops.io/v1/cfd95b7e-3bc7-4006-a4a8-a73a79c71255/alerts?filter=sender.connectionId%3D%27a4e3c212-5f84-4be3-989b-d3f293f0183e%27" -d '{"state": "closed"}'
-
Alerts from the old connection should now be closed.
Dynatrace topology does not support proxy target system
The Dynatrace topology feature in the Dynatrace metrics, events, and topology integration does not currently support a proxy target system.
Error messages in the agent log when the Dynatrace integration is disabled
There is currently a known issue whereby you may see the following error messages if you disable the Dynatrace integration:
java.lang.ArrayIndexOutOfBoundsException
Skipping concurrent rule based metrics collection
Dynatrace topology unable to create a second instance on the same cluster
Currently Dynatrace topology only allows you to create a single local deployment in a cluster. If you attempt to create a second Dynatrace topology instance using another name, the attempt will fail and no new pods will be created in the cluster.
Output to Db2 via the connector is not keeping up
There is currently a known issue with the IBM Db2 integration whereby even at a very low alert rate, the output to the Db2 instance using the connector is not able to keep up. If you are expecting an alert rate of around 700/s, you will see the data in your DB2 instance appear with a delay in seconds.
Workaround: Increase the resource management and HPA on both the IntegrationRuntime
and Db2
connector pods.
When using the Db2 integration, data may not be in sync when using an update trigger in a policy
There is currently a known issue with the IBM Db2 integration whereby the data in the INCIDENTS_REPORTER_STATUS table may not be in sync with AIOps updates if you are using an update trigger in the policy.
When there is just a create trigger in the policy, the data is valid. But if you are using an update trigger, the updates are happening asynchronously, so the latest update can be overridden with the previous update if they are happening at the same time.
There is no workaround currently available.
Db2 integration shows a status of Running regardless of whether correct login details have been specified
There is currently a known issue with the IBM Db2 integration whereby the Data collection status of the Db2 integration always displays a green tick to indicate that it is running even if incorrect login details have been specified when creating or updating the IBM Db2 integration.
There is no workaround currently available. If the Db2 connector does not return any table or the Db2 database appears not to be receiving incident or alert data, check the log files to see whether you have specified the Db2 login details correctly, and update the Username and Password fields in the IBM Db2 integration accordingly.
Integrations stuck in Initializing state
There is currently a known issue whereby integrations occasionally appear to be stuck in the Initializing state.
This issue will resolve itself after a delay.
ServiceNow incidents are missing on historical pull
If you do a historical pull of incidents on GitHub or Jira first and then ServiceNow after, you might notice that the ServiceNow incidents are missing.
The problem is due to changes in the ServiceNow incident schema. To resolve this issue, delete the snow incident index, which all the ticket systems use. Use the following steps to fix the issue with the schema:
-
Open a terminal window.
-
Run the following 4 commands to enable the port forwarding:
export EL_SECRET_NAME=`oc get AIOpsEdge aiopsedge -o jsonpath='{.spec.elasticsearchSecret}'` export EL_USER=`oc get secret $EL_SECRET_NAME -o go-template --template="{{.data.username|base64decode}}"` export EL_PWD=`oc get secret $EL_SECRET_NAME -o go-template --template="{{.data.password|base64decode}}"` oc port-forward aiops-ibm-elasticsearch-es-server-all-0 9200:9200
-
Open another terminal window and run the following 4 commands to delete the snow incident index:
export EL_SECRET_NAME=`oc get AIOpsEdge aiopsedge -o jsonpath='{.spec.elasticsearchSecret}'` export EL_USER=`oc get secret $EL_SECRET_NAME -o go-template --template="{{.data.username|base64decode}}"` export EL_PWD=`oc get secret $EL_SECRET_NAME -o go-template --template="{{.data.password|base64decode}}"` curl -X DELETE --user $EL_USER:$EL_PWD https://localhost:9200/snowincident/ -k
-
Re-run the ServiceNow historical pull to populate the index again. Your incidents now show up and can be used for similar incident training. If ServiceNow is used first, then historical pulls to GitHub and Jira do work.
Topology status processing of Instana events slows over time
There is currently a known issue whereby the Topology Status pod may run more slowly over time due to the way Instana events are being processed by IBM Cloud Pak for AIOPS, until ultimately being unable to keep up with incoming alerts. As a consequence, the Instana Event Collector should not currently be used and a webhook should be used for the collection of events from Instana instead.
For details about creating a webhook integration, see Creating Generic Webhook integrations.
For details of how to configure a webhook as an alternative way of collecting Instana events, see the Utilise AIOps Generic Webhook with JSONata mapping to ingest Instana alerts blog.
Instana connector not correctly generating event clears for events incoming
The deduplicationKey
that is generated by the Instana connector is dependent on the resourceName
field. There is a possibility that for a given Instana event, different values for resourceName
can be
generated based on the timing of the topology call. This can result in duplicate alerts as well as orphaned alerts, as the closes do not match the initial alert that was created.
Workaround: The Instana Event Collector should not currently be used and a webhook should be used for the collection of events from Instana instead.
For details about creating a webhook integration, see Creating Generic Webhook integrations.
For details of how to configure a webhook as an alternative way of collecting Instana events, see the Utilise AIOps Generic Webhook with JSONata mapping to ingest Instana alerts blog.
New Relic observer job fails with an authorization error
You might notice that the New Relic observer job fails due to an authorization error. The issue occurs due to changes in the New Relic API.
The error can resemble the following example:
ERROR [2025-02-06 16:02:51,209] [pool-12-thread-1] c.i.i.t.o.n.j.NewRelicLoadJob - Failed to validate connection: javax.ws.rs.NotAuthorizedException: HTTP 401 Unauthorized
There is no workaround currently available.
Applications and topologies
- Composite resources with differing geolocation markers are plotted separately in the Resource map
- Resources with a _compositeId value have 'Related Services' and 'Related resource groups' tabs disabled
- JVM heap out-of-memory (OOM) failures when loading large number of resources
- Critical error message displayed when attempting to render an application
- Different date and time in Cloud Pak for AIOps console and ChatOps between users
- Deleting a tag template can cause out-of-memory errors
- The Find Path tool ignores filters
- Fault localization and blast radius are not producing accurate results
- High volumes of data can cause Spark workers to run out of memory
- Probable cause is not producing accurate results
- Azure observer missing subnet relationship in topology
- Topologies not visible on Incident Topology page after resource merge
- Openstack observer missing edge-runsOn connectivity in topology
- Topology viewer user interface crashes after the update manager is displayed
Composite resources with differing geolocation markers are plotted separately in the Resource map
On rare occasions a composite resource may contain more than one geolocation marker. All of these will be plotted on the Resource map. If one of these locations falls outside the displayed map area, its status is not displayed.
Workaround: None. Be aware of this quirk when viewing composite resource on the Resource map.
Resources with a _compositeId value have 'Related Services' and 'Related resource groups' tabs disabled
Group and service information cannot be fetched for resources that are part of a composite.
This defect is only encountered when viewing the related service or related resource group details of a resource that is part of a composite. This resource will have the 'Related services' and 'Related resource groups' tabs disabled.
Workaround: none
JVM heap out-of-memory (OOM) failures when loading large number of resources
When running topology loads in quick succession, it is possible to experience some OOM errors and undesired topology pod restarts, even though the pods will continue the processing after restarting.
This error can occur when running resource loads of several millions in a large deployment, and could slow down the loading process. The following type of error message can be seen in the pod logs:
WARN [2022-10-25 15:43:31,906] [JanusGraph Session-io-4] c.d.o.d.i.c.c.CqlRequestHandler - Query ‘[4 values] SELECT column1,value FROM janusgraph.graphindex WHERE key=:key AND column1>=:slicestart AND column1<:sliceend LIMIT :maxrows [key=0x02168910cfd95b7e3bc74006a4a8a73a79c71255a0726573...<truncated>, slicestart=0x00, sliceend=0xff, maxrows=2147483647]’ generated server side warning(s): Read 5000 live rows and 1711 tombstone cells for query SELECT value FROM janusgraph.graphindex WHERE key = 02168910cfd95b7e3bc74006a4a8a73a79c71255a07265736f757263e5 AND column1 > 003924701180012871590290012871500290 AND column1 < ff LIMIT 5000; token 9157578746393928897 (see tombstone_warn_threshold) JVMDUMP039I Processing dump event “systhrow”, detail “java/lang/OutOfMemoryError” at 2022/10/25 15:43:32 - please wait. JVMDUMP032I JVM requested System dump using ‘/tmp/cassandra-certs/core.20221025.154332.1.0001.dmp’ in response to an event
Cause: Not enough headroom exists between JVM memory limit and the pod memory limit, usually because one was increased without also increasing the other.
Workaround: Ensure that any changes in heap size maintain enough headroom between these settings.
Example: In this example (for a topology size1) the pod limits are set to 3.6 GB while the maximum memory for the JVM (-Xmx
) is set to 3 GB, thereby leaving 0.6 GB of headroom free for use by the OS.
size1:
enableHPA: false
replicas: 2
jvmArgs: "-Dcom.ibm.jsse2.overrideDefaultTLS=true -Xms1G -Xmx3G"
resources:
requests:
memory: "1200Mi"
cpu: "2.0"
limits:
memory: "3600Mi"
cpu: "3.0"
Critical error message displayed when attempting to render an application
This problem occurs when all of the groups, within the application you are attempting to render, have no members. When the application is selected in Application management, it does not render and a critical error message is displayed on the UI.
Avoid creating applications with no members. If an application with no members was created for test purposes only, then ignore this error.
Different date and time in Cloud Pak for AIOps console and ChatOps between users
The date and time format for an Incident in the IBM Cloud Pak for AIOps console Application management tool and the associated ChatOps notification can be different between users. The format and time zone that is used in the Cloud Pak for AIOps console and ChatOps notification is set to the user's locale. If different users are in different time zones, the displayed date and time are different in the Cloud Pak for AIOps console and ChatOps notification.
Deleting a tag template can cause out-of-memory errors
If a tag is applied to a large number (that is, thousands) of topology resources, then deleting the tag template can cause out-of-memory errors with the topology-merge
pod.
Avoid creating tag templates that use tags that occur with such frequency. Later, do not tag thousands of resources with the same tag, and avoid them being used in a group.
The Find Path tool ignores filters
The topology path tool fails to launch with filters applied.
Launch the path tool without filters, then manually apply the filter settings on the path page.
Probable cause is not producing accurate results
The correlation algorithms for probable cause currently require the use of a Kubernetes model with service-to-service relationships, or the use of dependency relationships between non-Kubernetes resources.
Complete the following steps to create the required relationships for Kubernetes. This procedure configures Topology Manager to overlay relationships provided by the File observer onto the Kubernetes topology.
Note: The Kubernetes observer must be configured and loading data.
-
Log in to the IBM Cloud Pak for AIOps console.
-
From the main navigation, expand Operate and click Topology viewer.
-
From the topology navigation toolbar, expand Settings, click Topology configuration.
-
On the Rules tile, click Configure to navigate to the Rules administration page.
-
On the Merge tab, click New to create a New merge rule.
In this scenario, data that is provided by the File observer will be used to add the relationships.
-
Specify the following information on the New merge rule page:
-
Rule name:
k8-file-service
. -
Set the rule Status to Enabled.
-
Add the
uniqueId
property to the set of Tokens. -
Expand the Conditions section and select File and Kubernetes from the set of available Observers and click Add.
-
Specify
service
for Resource types and click Add. -
Click Save to save the new Merge rule.
-
-
Locate the services that you want to relate together and make a note of their source-specific uniqueId, such as
05f337a1-5783-43bb-9323-dfba941455c7
(shipping) andae076382-3df9-46cb-97e9-a0342d219efb
(web). -
Create a file for the File Observer that contains the
service-dependsOn-service
relationships necessary for the correlation algorithms to work.The following example creates two services, web and shipping, and states that web
dependsOn
shipping. Repeat this as required to relate your services together.V:{"uniqueId": "05f337a1-5783-43bb-9323-dfba941455c7", "name": "shipping", "entityTypes": ["service"]} V:{"uniqueId": "ae076382-3df9-46cb-97e9-a0342d219efb", "name": "web", "entityTypes": ["service"]} E:{"_fromUniqueId":"ae076382-3df9-46cb-97e9-a0342d219efb", "_edgeType":"dependsOn", "_toUniqueId":"05f337a1-5783-43bb-9323-dfba941455c7”}
-
Load this file into Topology Manager to relate the services. For more information, see Configuring File Observer jobs.
If your topology changes, then re-create and reload the file as required. A similar process can be followed for non-Kubernetes sources.
High volumes of data can cause Spark workers to run out of memory
If your environment handles high (10+ millions) workloads of alerts or event, your Spark workers can run out of storage (ephemeral storage). If you encounter this issue, restart the affected Spark workers. This issue can also occur if you are running multiple jobs, which can cause the file system to fill, such as with log or JAR files.
Azure observer missing subnet relationship in topology
For the Azure Observer, a subnet can be intermittently missing the relationship with an IP address in the topology for a resource. While the relationship can be intermittently missing, both the subnet and IP address verticies remain available in the topology.
Topologies not visible on Incident Topology page after resource merge
When resources from two observer sources have been merged using the topology merge functionality, the topology is no longer displayed in the Incident view. This known issue affects only the Incident view, and the topology is still present in all other views.
Openstack observer missing edge-runsOn connectivity in topology
After running an Openstack observer job, the edge-runsOn
connectivity between ComputeHost
and Hypervisor
elements is not shown in Resource Managenent -> Resources
when it should be.
Topology viewer user interface crashes after the update manager is displayed
The issue is that you cannot use the update manager feature in topology viewer. The solution is to change your topology viewer user preferences to auto render changes on refresh, which prevents the update manager from appearing.
Infrastructure Automation
- Kubernetes permissions are missing for user roles for using Managed services and the Service catalog
- Non-LDAP users cannot access Infrastructure Management
- Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core are not supported
- Users are redirected to the Administration panel when logging back into the UI
- Managed services secrets and certificates cannot be customized during installation
- Database fails to reset when error occurs during database creation for Infrastructure Management
- The
cam-tenant-api
pod is not in a ready state after installing the iaconfig CR - Print or export as PDF entire tables does not work as expected
- Infrastructure Management log display in the UI is removed
- Infrastructure Automation Test deploy fails
- On Infrastructure Management appliances, an Ansible playbook deployment fails
- After restoring from a backup, the Managed services deployment fails
- Infrastructure Management fails to save container provider after changing to a new token
- Infrastructure Automation install fails on FIPS enabled Power cluster
- Embedded Terraform feature not supported with FIPS enabled OpenShift Container Platform cluster
- Auto-generated Service Dialog contains the wrong values for variables of type Integer and Boolean
Kubernetes permissions are missing for user roles for using Managed services and the Service catalog
If you install Infrastructure Automation, you, or an administrator, must add the required Kubernetes permissions to user roles before your users can begin to access and use Managed services or the Service catalog.
As an administrator, add the following permissions to your use roles:
Role | Required permission for Infrastructure Automation |
---|---|
Automation Administrator | Administer Kubernetes resources |
Automation Operator | Manage Kubernetes resources |
Automation Developer | Edit Kubernetes resources |
Automation Analyst | View kubernetes resources |
For more information about how to add permissions to a role, see Managing roles for Infrastructure Automation.
Non-LDAP users cannot access Infrastructure Management
Non-LDAP authenticated users cannot be used with Single Sign-On for Infrastructure Management. If you attempt to use Infrastructure Management with a non-LDAP authenticated user, you can encounter the following error:
While logged in to the Infrastructure Automation UI console with a non-LDAP user, attempting to start Infrastructure Management fails with an error. This is a limitation.
The error states:
OpenID Connect Provider error: Error in handling response type.
Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core are not supported
Installation of Infrastructure Management in IBM Cloud Pak for AIOps does not support Red Hat Advanced Cluster Management and IBM Cloud Pak for Multicloud Management core. You can continue to use the Kubernetes cluster life-cycle templates and services to create a Kubernetes cluster and import the cluster to an existing installation of Red Hat Advanced Cluster Management, if an installation is available. Deploying hybrid applications are also not supported by Infrastructure Automation.
Users are redirected to the Administration panel when logging back into the UI
When you are working within Infrastructure Automation and log out and then log back in, you can be redirected to the Administration panel instead of the Infrastructure Automation home page or other page that you were previously using. If this occurs, you can use the Cloud Pak switcher in the upper right of the UI console to switch to the Infrastructure Automation home page and then return to the page that you were previously using.
Database fails to reset when error occurs during database creation for Infrastructure Management
If you are creating the database for the Infrastructure Management appliance and you encounter an error, such as the database creation failing to complete successfully, you might not be able to continue with your setup without redeploying. For instance, if the creation fails, resetting the database to clean up your database and deployment can also fail. To resolve this issue, you need to redeploy the Infrastructure Management appliance image before reattempting to create the database.
The cam-tenant-api
pod is not in a ready state after installing the iaconfig CR
After you install Infrastructure Automation, you can encounter an error where the cam-tenant-api
pod displays as running, but not in a ready state. When this error occurs, you can see the following message:
[ERROR] init-platform-security - >>>>>>>>>> Failed to configure Platform Security. Will retry in 60 seconds <<<<<<<<<<<<< OperationalError: [object Object]
If this error occurs, delete the cam-tenant-api
pod to cause the pod to restart and attempt to enter a ready state.
Print or export as PDF entire tables does not work as expected
If you are using the Firefox browser, and you select Print or export as PDF on the Compute > Containers > Projects page to print or export the entire table of data, the print, or export might not work as expected. For instance, some data, such as table rows might be missing. If you encounter this issue, try a different browser for printing or exporting the data.
Infrastructure Management log display in the UI is removed
Log display support on the UI is removed for Infrastructure Management. As an alternative for viewing these logs, use Kubernetes standard methods such as oc log
commands, viewing the output in Red Hat OpenShift Container Platform
or Kubernetes, or setting up a log aggregator for your cluster.
You can still see the log tabs (Collect Logs, IA:IM Log, Audit Log, and Production Log) on the Settings > Application Settings Diagnostic page. However, instead of displaying the log information, the following message is displayed: Logs for this IA:IM Server are not available for viewing
.
Infrastructure Automation Test deploy fails
Infrastructure Automation Test deploy from a Service Overview page fails to deploy.
On Infrastructure Management appliances, an Ansible playbook deployment fails
When you attempt to deploy an Ansible playbook on an Infrastructure Management appliance through an embedded Ansible deployment, the playbook deployment can fail with the following error:
<35.237.119.31> ESTABLISH SSH CONNECTION FOR USER: ubuntu
fatal: [35.237.119.31]: FAILED! => {
"msg": "Unable to create local directories(/home/manageiq/.ansible/cp): [Errno 13] Permission denied: b'/home/manageiq'"
}
If you encounter this error, log in to the appliance as the root user and then deploy the playbook again:
-
Run the command:
mkdir -p /home/manageiq
-
Run the command:
chown manageiq:manageiq /home/manageiq
-
Deploy the Ansible playbook again.
After restoring Managed services from a backup, the Managed services deployment fails
After you restore Managed services (cam
) from a backup, the deployment instance fails with a socket hang up
error.
If this error occurs, restart the cam-iaas
pod by running the following command:
oc delete pod <cam-iaas-xxxx> -n <namespace>
Where <namespace>
is the project (namespace) where Infrastructure Automation is installed, and <cam-iaas-xxxx>
is the name of the cam-iaas
pod to restart.
With this restart, the service deployment can complete successfully.
Infrastructure Management fails to save container provider after changing to a new token
Infrastructure Management fails to save changes to a container provider after updating the access token when the Metrics collection is enabled.
Workaround: After updating and validating the new token in the Edit provider dialog box, switch to the Metrics tab and validate the existing endpoint. The Save button is now enabled.
Infrastructure Automation install fails on FIPS enabled Power cluster
There is an intermittent issue with the installation of standalone Infrastructure Automation on a FIPS enabled Linux on Power (ppc64le) cluster. This occurs when AllNamespace or OwnNamespace mode is used. A problem with the events operator pod causes the installation to fail.
Embedded Terraform feature not supported with FIPS enabled OpenShift Container Platform cluster
The Embedded Terraform feature in Infrastructure Automation is not yet supported in a FIPS enabled OpenShift Container Platform cluster.
Auto-generated Service Dialog contains the wrong values for variables of type Integer and Boolean
You might notice that the auto-generated Service Dialog contains the wrong types for Integer and Boolean variables.
The workaround for this issue is to manually edit the auto-generated Service Dialog. For the variable of type Integer, set the Value type
to an Integer and Validation
field to Yes.
For the variable of type Boolean, delete the field and replace it with a checkbox.
UI console
- Tour icon disappears when browsing to another console page while a guided tour is running
- 'Data cannot be displayed': error given for 'Defined Applications' and 'Favorite Applications' tiles on the Home page
- User credential timeout starts in the backend
- Unable to access Identity Providers link in console
- Slow loading pages
- Alert Viewer page shows 400 Error
Tour icon disappears when browsing to another console page while a guided tour is running
When you start a guided tour and navigate away from the page to the IBM Automation Home page, the Tour icon might not display on the toolbar. This behavior can occur when an IBM Cloud Pak for AIOps tour is still running. Only one guided tour can run at a time. To resolve this issue, return to an IBM Cloud Pak for AIOps page, click on the Tour icon and close a tour. When you return to the IBM Automation Home page, the Tour icon reappears.
'Data cannot be displayed': error given for 'Defined Applications' and 'Favorite Applications' tiles on the Home page
On opening the Console home page neither 'Defined Applications' nor 'Favourite Applications' is listed in their tiles. However, they do exist as they can be viewed under the 'Resource Management' section. The fix for this is to restart the aiops-base-ui pods.
User credential timeout starts in the backend
The first indications of a user credential timeout might be backend failures. For example, failure to load incident, alert, or policy lists. To resolve this issue if it occurs, log out and log back in again, or wait for the frontend logout to occur.
Unable to access Identity Providers link in console
You might notice that you are unable to access the Identity Providers link located in the UI console. You might see an error such as HTTP ERROR 431. If this issue occurs, configure the LDAP connection. For more information, see Identity Management (IM).
Slow loading pages
If you are experiencing slow loading of Cloud Pak for AIOps pages, it might be because the server TLS certificate is not trusted by your browser. Some browsers (for example, Microsoft Edge and Google Chrome) prevent caching of resources when the server’s certificate is untrusted. This means all static resources associated with the page are fully reloaded on every refresh, significantly slowing down page loads.
To resolve the issue, a certificate signed by a certificate authority that is trusted by your client devices should be used. Using a custom certificate is described in the Using a custom certificate page. This may be a certificate signed by a well known certificate authority, or an internal certificate authority pre-configured on your client devices.
Alert Viewer page shows 400 Error
You might notice that you are unable to access the Alert Viewer page. You might see a 400
error. This error is caused when the cookies for the IBM Cloud Pak for AIOps domain in your browser get too large for the data layer service
to consume.
To resolve this issue, clear the cookies in your browser and then reload the page.
AI Model management and training
- Log parsing assigns messages to catch-all template instead of generating expected template
- Elasticsearch record count does not match record count published to Kafka topic
- Log anomaly detection EXPIRY_SECONDS environment variable not retained after an upgrade
- Log anomalies are not detected by natural language log anomaly detection algorithm
- Metric anomaly detection training does not run on schedule
- In Change risk training, Precheck indicates “Good data” but models fail to create
- Alerts for the Log Anomaly - Golden Signals algorithm are not generated when inference log data contains name and value pairs
- Counts against template patterns are not updated in the training UI
- Similar tickets training in IBM Cloud Pak for AIOps on Linux
Log parsing assigns messages to catch-all template instead of generating expected template
If you use catch-all templates for mapping uncategorized messages during AI model training, you can encounter an issue where the log parsing assigns messages for an error to the catch-all templates instead of generating an expected template for that error. If this issue occurs, you might not see expected anomalies.
If you suspect this issue is occurring and you do not see expected anomalies, complete the following steps to manually verify your training templates, and remove any catch-all templates that incorrectly generated.
-
Retrieve the normalized logs from your logtrain indices.
-
Identify the logs that are error logs. Review those logs to determine the template mappings.
-
Retrieve the identified templates from Elasticsearch.
-
Use the error log contents and the template ID from the retrieved normalized logs to identify the template string within the retrieved templates.
-
If the template string is comprised entirely of parameters, or a single word and parameters, the template might be a catch-all template. For example, the following string is an example of a catch-all template:
<>to <><><><><> <> <> <> <>-<>-<> <> <> <> <> <> <> <> <> <> <>
-
Manually delete any catch-all templates.
Elasticsearch record count does not match record count published to Kafka topic
When you push a large training file (for example 60 M records, such as logs or events) to Kafka through your configured integration, the number of records that are ingested and displayed on Elasticsearch might not match. Elasticsearch record count might be lower than Kafka count due to deduplication. If you encounter this issue, split large files into smaller batches and send them individually to Kafka (for example, 5 M records each). When you are pushing a batch, ensure that you wait for an ingest to complete and the associated records display on Elasticsearch before you push the next batch of records.
Log anomaly detection EXPIRY_SECONDS environment variable not retained after an upgrade
If you set a value for the EXPIRY_SECONDS environment variable and upgrade, the environment variable is not retained after the upgrade.
After the upgrade is completed, set the environment variable again. For more information about setting the variable, see Configuring expiry time for log anomaly detection alerts.
Log anomalies are not detected by natural language log anomaly detection algorithm
In some cases a model that has been trained successfully is unable to detect certain log anomalies. The quality of the model is independent of whether it trained successfully, and model quality tends to improve as more training data is available. If the model is not detecting anomalies in your logs, consider training the model again but using additional days of training data to improve the model quality.
Metric anomaly detection training does not run on schedule
If you have metric anomaly detection training scheduled to run, such as daily, you can encounter an issue where the training does not run as scheduled. If the training job does not run on schedule, log in to the IBM Cloud Pak for AIOps console and click the Metric anomaly detection algorithm tile and then Train models.
In Change risk training, Precheck indicates “Good data” but models fail to create
On rare occasions, a Change risk model fails to create, even though Precheck data indicates that the data is good. This failure is caused by an insufficient number of problematic change risk tickets being available to create a good model. This problem resolves itself when enough tickets become available for the model. (For more information, see Closed change ticket count requirements).
To confirm that insufficient problem tickets is causing the failure, view the Change risk logs on the training pods.
To retrieve the pod:
oc get pod | grep training-cr
View the logs for training Change risk models.
oc logs <pod-name> # Ex: training-cr-1b5ef57f-9053-4037-95ca-c1e8b8748fc5
Check whether the log contains the following
size of the problematic (aka labels) tickets is insufficient
If confirmed, ensure that enough problematic change tickets are available before training the model again.
Alerts for the Log Anomaly - Golden Signals algorithm are not generated when inference log data contains name and value pairs
In IBM Cloud Pak for AIOps 4.8.1, alerts might not be generated for the Log Anomaly - Golden Signals algorithm when inference log data contains name and value pairs. These pairs are tokens with the key=value
pattern. They might
prevent anomalies from being matched to their respective templates.
For example, training generates the following log template:
<> exe="/usr/bin/dbus-daemon" sauid=UNKNOWN_VAR hostname=? addr=? terminal=?'
Then, during inference, incoming log data is matched against the template from training. If the incoming log data contains tokens with the key=value
pattern like in the following example, the logs are classified as unmatched.
[3557470.922719] exe=\"/usr/bin/dbus-daemon\" sauid=103 hostname=? addr=? terminal=?'
[3557971.107893] exe=\"/usr/bin/dbus-daemon\" sauid=103 hostname=? addr=? terminal=?'
Alerts from this set of logs are not displayed.
Counts against template patterns are not updated in the training UI
When the log anomaly detection - golden signals algorithm generates alerts in IBM Cloud Pak for AIOps 4.8.1, the counts against those template patterns are not updated in the training UI table. Enable historic alert storing in Elasticsearch to access alert counts.
Workaround: Enable the components by editing the installation. Use the following commands to access the installation through the command line. Replace <installation-name>
with the name of the installation:
oc get installation
oc edit installation <installation-name>
Or access the installation with the Red Hat® OpenShift® console. Go to Operators > Installed Operators > IBM Cloud Pak for AIOps > IBM Cloud Pak for AIOps and edit
the YAML file for aiops-installation
.
After you access the installation, edit it to include the following values:
spec:
automationFoundation: {}
license:
accept: true
pakModules:
- config:
- name: ir-core-operator # Find the config item with this name, or add this item if it does not exist
spec:
issueresolutioncore:
customSizing:
deployments:
- name: datarouting
replicas: 1 # Use 3 for large deployment
- name: esarchiving
replicas: 1
enabled: true
name: applicationManager # Find the pakModules item with this name, or add this item if it does not exist
After the components are enabled, new alerts are stored to Elasticsearch indices, and the alert counts for subsequent alerts are updated correctly in the training UI table.
Similar tickets training in IBM Cloud Pak for AIOps on Linux
Similar tickets training is not available in IBM Cloud Pak for AIOps on Linux. You can manually add tickets from your ticketing integrations in the incident overview.
ChatOps
- In Slack, a ChatOps communication to IBM Cloud Pak for AIOps times out without establishing a connection
- In a Microsoft Teams ChatOps, the attach template logs feature does not work
- Incidents cannot be reopened or restored in ChatOps
- No Recommended runbooks found in incident overview
- No incident content viewable in Microsoft Teams on mobile devices
- During a ChatOps secure tunnel creation an 'installation failed' message displays
In Slack, a ChatOps communication to IBM Cloud Pak for AIOps times out without establishing a connection
When sending a ChatOps communication from Slack to IBM Cloud Pak for AIOps, a known intermittent issue can occur. The communication between Slack and IBM Cloud Pak for AIOps can time out after 3 seconds of no response. A potential solution is to reconfigure your connection in the connection onboarding. Ensure that your Slack Bot can access your IBM Cloud Pak for AIOps instance in a prompt fashion. A more permanent and robust solution to this issue is being devised.
In a Microsoft Teams ChatOps, the attach template logs feature does not work
If you have a Microsoft Teams ChatOps, clicking the Attach template logs button does not work and the logs are not sent to your Microsoft Teams channel for review. As an alternative, use the Preview logs button to view the template logs.
Incidents cannot be reopened or restored in ChatOps
When an incident is closed it is archived and can no longer be modified. If a new alert occurs that is related to the archived incident, a new incident is created instead of reopening the archived incident.
No Recommended runbooks found in incident overview
If a runbook recommends to remediate an incident is deleted from the runbook Library, it does not remove the Recommended runbooks link in a ChatOps notification. This can result in the ChatOps runbook section linking out to an empty runbooks page in the incident overview.
No incident content viewable in Microsoft Teams on mobile devices
On mobile devices, when viewed in Microsoft Teams, incidents can appear with no viewable data. Where this happens, switch to using a computer to see the full incident data.
During a ChatOps secure tunnel creation an 'installation failed' message displays
When you create a ChatOps integration and it fails, wait for a few minutes to see if the installation retries, and if it does not simply create the integration again.
Incidents and alerts
- Active incident count is wrong on the Resource management page
- "An error occurred while fetching data from the server"
- Alerts tab shows "An unknown error occurred" error when all alerts are closed
- Closed incidents are missing details and displaying a critical error
- Metric anomaly chart unexpectedly changes from zoom view to normal view
- Unable to add metric anomaly in Related alerts to chart
- Limitation of preview text for default recommended action
- Metric search page chart lines are disjointed
- Some alerts not cleared even with a resolution event
- Alert views are unusable without ID and SUMMARY columns
- Filter on short ID in the incident table doesn't work
Active incident count is wrong on the Resource management page
The Resource management page displays a number of incidents in the Active incidents column instead of displaying one incident with a number of alerts. The Application viewer displays the correct information, however.
This error can occur when related groups or applications are erroneously linked to the incident count.
Workaround: Verify the correct number of incidents and alerts on the Application viewer.
"An error occurred while fetching data from the server"
When viewing the Incidents UI or when creating a policy to assign runbooks to alerts, you might see the message "An error occurred while fetching data from the server". Or in the Alert Viewer, you might see "An unknown error
occurred". If you encounter these error messages, complete the following steps to delete the aiops-ir-ui-api-graphql
pod. The pod is then automatically re-created, which should resolve the error.
-
Log in to your cluster by running the oc login command.
oc login -u kubeadmin -p <password>
-
Delete the aiops-ir-ui-api-graphql pod.
oc delete pod -l component=aiops-ir-ui-api-graphql -n <cp4aiops_namespace>
-
Wait for the pod to restart.
Alerts tab shows "An unknown error occurred" error when all alerts are closed
If you are viewing a closed incident that has all associated alerts resolved, you can encounter an error when you view the Alerts tab. This "An unknown error occurred" error displays when there are no associated alerts. You can ignore this error message as the incident and alerts are resolved and closed.
Same runbook status for multiple alerts in an incident
If more than one alert meets a runbook policy's conditions, the same runbook can be assigned to multiple alerts in an incident. From the incident overview page, you can select an alert and run an associated runbook. The runbook Status of the selected alert will be updated on the UI. However, the runbook status of other alerts in the incident might be updated with the same status. This is a known issue.
Closed incidents are missing details and displaying a critical error
Incidents with status of "Closed" are missing topology information on the incident Overview tab. The associated alerts of the closed incidents are also missing from Alerts tab. The following critical error is displayed on the Topology tab for closed incidents: "No resource exists with the specified identifier and time point".
Metric anomaly chart unexpectedly changes from zoom view to normal view
This can occur when a related alert is selected and added to a chart you are zoomed in on. The chart resets to normal view. To resolve, zoom in again after related alert is added to chart.
Unable to add metric anomaly in Related alerts to chart
In some cases, when you click the checkbox in the Related alerts, it does not add the related anomaly to the metric anomaly chart.
Limitation of preview text for default recommended action
When an alert is generated from any default log anomaly detection models, the preview of a recommended action might contain partial texts and does not reflect the full view of the recommended action. The first 4000 characters are extracted from the original resolution or action document webpage where possible, from which nonreadable texts such as URLs are excluded to form the content of the preview text.
Metric search page chart lines are disjointed
In the Metric search page chart, the normalised forecast line is disjointed from the baseline data line. This is because the forecast data is normalised independently from the baseline data. Although the lines might not match up, the values shown in the tooltips are correct.
Some alerts not cleared even with a resolution event
In scenarios where large amounts of historical event data are ingested into the system, it's possible that problems and resolutions can be processed out of order, sometimes resulting in alerts not clearing as expected. To avoid this issue, try ingesting smaller batches of event data into the system.
Alert views are unusable without ID and SUMMARY columns
When creating views in the Alert Viewer, you must include the ID and SUMMARY columns in the view. Otherwise, the view will be unusable and can only be deleted by using the API.
Filter on short ID in the incident table doesn't work
You cannot use the short ID from the incident table as a filter condition under Other properties. Instead, use the full incident "id"
which can be found in the incident details side
panel > raw data format.
Policies
- Condition Value field changes
String:alert
toValue of:alert
- Breadcrumb navigation missing in policy editor
- Condition Matches field for numeric Operator selections
- Last run time and Matched count is updated in policies other than the trigger policy
- Double scroll bars on browser window
- Policy processing failing due to long policy conditions
- No ServiceNow ticket created by incident creation policy
- Datarouting pods can fail and need to be restarted
- Using default values in automatically run runbooks not working
- Netcool alert not suppressed when X in Y suppression policy conditions are met
- Cloud Pak for AIOps Impact policy fails to run when 'No input' is selected
- Policy triggered based on 'incident-updated' runs more times than expected
- Parameter mapping issues in IBM Tivoli Netcool/Impact policies
- Can't use alert.details.name with "any of" in a policy condition
- Policy is triggered against alerts that don't match the condition
- WebSphere resolution action recommendation policy showing a failed status
Condition Values field changes "String:
" to "Value of:
"
For example, in a policy condition if the string "alert.id
" is typed in the Values field, and then "String:alert.id
" is selected, it is changed to "Value of:alert.id
".
To prevent this, avoid a string that exactly matches the keyword. In this example, use the following condition instead:

Note: This example is not an exact match of the string alert.id. This workaround finds all summaries containing alert. and .id.
Breadcrumb navigation missing in policy editor
In some cases where a policy name is long, the breadcrumb navigation in the top left of the policy edit session can be abbreviated. Clicking the breadcrumb still returns you to the Policy UI.
Condition "Matches" field for numeric "Operator" selections
When using a numeric Operator in a policy condition set (for example, greater than, less than, greater or equal) all options can be selected under Matches. However, always select Only for use with a numeric operator.
"Last run" time and "Matched" count updated in policies other than the trigger policy
In a case where an alert meets the incident-creation conditions of multiple policies, only one incident is created. However, all policies that proposed an incident, and the system incident creation policy has the same Last run times on the Policies hub. Each of these policies also increment their Matched counts by 1 in the Details tab of the side panel.
Double scroll bars on browser window
An extra scroll bar might appear on the right side of the browser window.
Policy processing failing due to long policy conditions
Policies that have conditions that are long (100,000+ characters) can cause policy processing to fail, resulting in no alerts or incidents being created. Failure can occur when temporal correlation training generates groups containing many alerts, and those alerts have long resource or type fields.
If this problem occurs, such policies should be disabled from the automation policy UI. Identify the policies by
- Filtering for analytics policies.
- Sorting by last run time (to identify those that have been processed recently and are likely triggering the problem).
- Viewing the specification for each to see whether any have a long condition.
Any policies that show up by implementing these steps should be disabled.
No ServiceNow ticket created by incident creation policy
An alert meets the incident-creation conditions of a policy, but no ServiceNow ticket is associated to the incident.
To avoid this, complete the following steps under actions in the policy editor:
- Click Assign and notify.
- Select In an existing channel.
Datarouting pods can fail and need to be restarted
After installation, data for display in the UI, such as in the policy list, can be missing or stale.
If this issue occurs, restart the datarouting pods. You can identify these pods by using the following command:
oc get pods | grep datarouting.
Using default values in automatically run runbooks not working
This problem can occur by selecting a Default parameter value when creating a policy to assign runbooks to alerts. The useDefault
value is not passed during automatic execution but is passed during manual execution.
You can execute the runbook manually from the Runbooks page.
Netcool alert not suppressed when X in Y suppression policy conditions are met
If you create an X in Y suppression policy that matches an alert originating from an IBM Tivoli Netcool/OMNIbus environment, the alert will not be suppressed.
Cloud Pak for AIOps Impact policy fails to run when 'No input' is selected
When enabling a Cloud Pak for AIOps Impact policy to Invoke an IBM Tivoli Netcool/Impact policy, the policy fails to run when No input is selected under the policy parameter mapping options. To avoid this, select one of the other mapping options. For example, select Send the full alert.
Policy triggered based on 'incident-updated' runs more times than expected
If you have a Cloud Pak for AIOps policy that is triggered based on incident-updated, in some cases, the policy might run more times than expected. This is because not all updates can always be processed atomically and the incident might then be updated multiple times. In turn, the incident-updated trigger will be activated more than once.
Parameter mapping issues in Netcool/Impact policies
The following known issues have been obeserved when creating policies to invoke IBM Tivoli Netcool/Impact:
- The policy cannot be created with a parameter mapping option of No Input selected. The policy is instead saved with a Send the full alert mapping.
- A Netcool/Impact policy cannot be edited to change the parameter mapping option. The change is not persisted. To workaround this problem, create a new Netcool/Impact policy with the new parameter mapping option and disable the current policy.
- When a Netcool/Impact policy is created with a trigger entity of Incident and a parameter mapping option of Send the full incident, the mapping changes to the Customize option if the policy is edited.
Can't use alert.details.name with "any of" in a policy condition
When alert.details is selected in the property field or the value field, the Details name field is an optional input where you can minimize the scope to a singular key within the alert's details. However, Details name cannot be used in a policy condition with the Matches option of "any of".
Policy is triggered against alerts that don't match the condition
This can occur when you try to compare a NULL value in the policy condition. The policy is triggered because you can have a condition of NULL = NULL matched when the parameters referenced in the policy are not present.
To avoid this problem, you can ensure that the alert property is not equal to NULL in the condition set. For example, see the following policy conditions for alert.details.name:

WebSphere resolution action recommendation policy showing a failed status
The preset policy "WebSphere resolution action recommendation policy" might initially show a failed status in the policy table. If you encounter this, the policy status should self correct after a period of time.
Secure Tunnel
Secure Tunnel connector is not running after a restart with Podman
This issue can occur when you install the Secure Tunnel connector to a host machine on which Podman is installed. When the host machine is rebooted and the Secure Tunnel connector is checked by using the podman ps -a
command,
the Secure Tunnel connector container does not display running
status.
If this issue occurs, the podman-restart
service must be activated by using the systemctl
command:
systemctl start podman-restart
systemctl enable podman-restart
After entering the command, check podman-restart
worked by using the following command:
systemctl status podman-restart
If the Connector is still not running, try restarting the host machine.
Runbook Automation
When alert.suppressed
value is used, runbook does not automatically run
Normally, you can select a runbook and configure it to run automatically: when an alert is converted to an incident, the runbook is assigned and runs automatically. However, if the parameter value alert.suppressed
is used, the
runbook does not run automatically as it reads this as a Boolean value rather than a string value. Therefore, it is necessary to manually run the runbook.
AIOps Insights
- Number of Incidents reported in Noise reduction chart is inconsistent with number of Alerts and Events
- AIOps Insights dashboard fails to load even when data is available
- Events not showing up on Noise reduction chart
Number of Incidents reported in Noise reduction chart is inconsistent with number of Alerts and Events
In the AIOps Insights dashboard, the Noise reduction chart normally indicates the number of alerts, events, and incidents reported over a specific time period. However, the inclusion of historical data – containing longstanding, unresolved alerts – can result in a skewering of the data that is presented on the chart: the number of incidents that are presented can outnumber the number of alerts and events. Normally, this number is less than either – alerts reduce to a smaller number of events and events reduce to a smaller number of incidents.
The anomalous incident number happens because the reduction time frame covers alerts and events that are generated in the time period selected (for example, 7 Days). However, the incidents are generated from all outstanding alerts, including alerts that are not resolved, historically: alerts that occurred before the selected time period. So, in these circumstances, while the number of alerts and events is correct, the number of incidents is not.
AIOps Insights dashboard fails to load even when data is available
Large amounts of data can cause the dashboard to fail to load or time out with the message Error – Metrics unavailable
displayed for each chart. The problem is a scaling issue. The AIOps Insights dashboard is
not yet developed enough to handle huge amounts of data. A possible workaround is to increase resources for insights-api
and elasticsearch
pods. However, this approach might not be successful.
Events not showing up on Noise reduction chart
The charts in AIOps Insights cover a timeline no greater than 30 days. The dashboard reads the firstOccurenceTime
value from only within that period. If an alert was created outside of that timeline, and deduplicated, it is not
added to the eventCount
in the AIOps Insights Noise reduction chart. In this scenario, the eventCount
for the alert increments in IBM Cloud Pak for AIOps, but not in the Events segment of the Noise reduction
chart.
Ticketing
- IBM Change Risk Assessment tab in ServiceNow not displaying change risk assessment details
- Select incident state transitions are not permitted in ServiceNow
- Data synchronization in ServiceNow is not working consistently
IBM Change Risk Assessment tab in ServiceNow not displaying change risk assessment details
In a proactive ChatOps channel, you can click the Change request ticket URL to view its details in a ServiceNow instance. However, in some cases in version 4.1.1 of the ServiceNow App, details might not be displayed in the IBM Change Risk Assessment tab.
To avoid this issue, update the ServiceNow App to version 4.2.1 or higher.
Select incident state transitions are not permitted in ServiceNow
Cloud Pak for AIOps enforces certain state transitions for its incidents. To synchronize incident data with ServiceNow, the following state transition restrictions must be enforced in ServiceNow as well:
- New -> On Hold
- On Hold -> New
- On Hold -> Resolved
- On Hold -> Closed
- On Hold -> Cancelled
An incident must be set to In Progress before it can go On Hold. An On Hold incident can only transition to In Progress.
Data synchronization in ServiceNow is not working consistently
If you have a ServiceNow integration, you can encounter an issue where updates to records in ServiceNow are not displaying for those incidents and alerts within Cloud Pak for AIOps.
To address this issue, run the following command from the namespace where Cloud Pak for AIOps is installed:
oc set env deployment/$(oc get deploy -l app.kubernetes.io/component=chatops-orchestrator -o jsonpath='{.items[*].metadata.name }') GUNICORN_TOTAL_WORKERS=1
If this issue occurs, review the following tasks to ensure your integration and processes are set up correctly: