Troubleshooting runtime issues

You might encounter an issue during WebSphere Automation normal operation, such as failed health investigations. Learn how to fix the most common runtime issues.

The following issues might cause a health investigation to fail. When an investigation fails, download the archive file for the investigation by using the WebSphere Automation UI or the REST API. Open the archive and examine the analysis.log file for errors.

FIPS-enabled server fails to register with WebSphere Automation after installing a JDK fixpack with a SecureRandom SHA2DRBG for provider not available error message

Installing a Java SDK runtime fixpack on a registered WebSphere Application Server or WebSphere Application Server Liberty server that is configured to be compliant with Federal Information Processing Standards (FIPS) might fail with the following error.

Stack Dump = java.lang.RuntimeException: SecureRandom SHA2DRBG for provider <provider_name> not available

To resolve this issue, configure the registered server according to the instructions in the following articles.

No contact with server, or contact lost with registered server

WebSphere Automation uses the usage metering feature of WebSphere Application Server and WebSphere Application Server Liberty to register servers and maintain regular contact with them. If WebSphere Automation is unable to contact a server, or if contact with a registered server is lost for more than six hours, WebSphere Automation displays visual indicators in the UI to make you aware of the situation. If the loss of contact is not due to a known and acceptable reason, try the following steps to diagnose and resolve the issue.

Ensure usage metering is correctly configured on the target server.
For more information, see Setting up security monitoring.
Restart the server.
If the server was removed from WebSphere Automation, or if the usage metering feature was disabled, enable the feature and restart the target server. This action leads to the server being registered with WebSphere Automation again. For more information, see Unregistering servers.
Check network connectivity.
If the usage metering feature is properly configured but contact is not established, check the logs on the target server for indications of network connectivity issues that might block communication between the server and WebSphere Automation.

Investigation not created for an Instana alert

This problem can be caused by the host not having any registered servers. If the investigation manager finds no registered servers for the host, the following error message is written in the investigation-manager log file:

Investigation cannot be started because no assets are registered with the example.com host.

Out-of-memory error for the memory analysis runner job

(In version 1.3 or later) java.lang.OutOfMemoryError: Java heap space
(In version 1.2) JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError"

By default, the memory request and memory limit of the memory analysis runner job are set to 4 GB. These settings are sufficient for the runner job to analyze most heap dumps. If you see this error message, the analyzer did not have enough memory to analyze the heap dump. You can allocate more memory to the memoryAnalysisRunner setting in WebSphereHealth custom resources. For more information, see WebSphereHealth custom resource. Alternately, you can edit the WebSphereHealth instance with the following command:

oc edit WebSphereHealth <instance-name> -n <namespace>

The default instance name is wsa-health. The default namespace is wasautomation.

Note: The memory defaults to 4Gi with a limit of 4Gi. You can increase the memory to a bigger value, such as 20Gi, as in the following example. Set the memory request and memory limit to the same value. The Java VM uses the amount of memory that is specified by the limit to calculate the maximum heap size. Kubernetes only guarantees that the process can obtain the amount of memory that is specified by the request.

spec:
  analysisManager:
    Image: …
    memoryAnalysisRunner:
      resources:
        limits:
          cpu: '1'
          memory: 20Gi
        requests:
          cpu: 500m
          memory: 20Gi

Note: When you allocate more resources to memoryAnalysisRunner, make sure that the worker nodes can handle the requests.

Failed to identify server on host example.com

Failed to identify the server on host example.com

This error can be caused by several problems. To resolve the issue, try the following steps:

Ensure that all of the prerequisites for managed servers are met. For more information, see Managed server requirements.
Test the connection between WebSphere Automation and the managed server.
Troubleshoot setup issues.

The MyCustomRole role includes an invalid permission: can_view_websphere_inventory

If you included the can_view_websphere_inventory permission in a custom role in version 1.1, this permission was removed in version 1.2. To fix your custom roles, you must use the API:

Get the API key from the cpd UI.
From cpd console, click User > Profile and settings, then click the API key button.

Get a bearer token to use for API calls:

curl -k -X POST -H 'Content-Type: application/json' -d '{"username":"<user_name>","api_key":"<api_key>"}' https://$(oc get route -n wasautomation -o jsonpath='{.items[?(@.spec.to.name=="ibm-nginx-svc")].spec.host}')/icp4d-api/v1/authorize

Get a list of the roles. This list is needed to get the extension name and JSON metadata, which is used in a subsequent step to modify the broken custom role:

curl -X GET -k -v -H "Authorization: Bearer <bearer_token>" --header "Content-Type: application/json" --header "Accept: application/json" https://$(oc get route -n wasautomation -o jsonpath='{.items[?(@.spec.to.name=="ibm-nginx-svc")].spec.host}')/api/v1/usermgmt/v1/roles

Example:

curl -X GET -k -v -H "Authorization: Bearer eyJhbGciOiJSUz..." --header "Content-Type: application/json" --header "Accept: application/json" -d '{"role_name":"mycustomrole","description":"","permissions":[]}' https://$(oc get route -n wasautomation -o jsonpath='{.items[?(@.spec.to.name=="ibm-nginx-svc")].spec.host}')/api/v1/usermgmt/v1/roles

Response (truncated):

{"rows":[{"id":"f60b72c3-ae7e-4860-8f98-649e316af6d2","key":"f60b72c3-ae7e-4860-8f98-649e316af6d2","doc":{"_id":"f60b72c3-ae7e-4860-8f98-649e316af6d2","extension_id":"_ce_703424172539772929","extension_name":"f60b72c3-ae7e-4860-8f98-649e316af6d2","role_name":"mycustomrole","description":"","permissions":["can_view_websphere_inventory"]...],"messageCode":"success","message":"success"}

For each custom role that contains the can_view_websphere_inventory permission, remove that permission and replace it with the can_view_application_runtime_security permission.

curl -X PUT -k -v -H "Authorization: Bearer <bearer_token>" --header "Content-Type: application/json" --header "Accept: application/json" -d '{"role_name":"","description":"","permissions":['can_view_application_runtime_security']}' https://$(oc get route -n wasautomation -o jsonpath='{.items[?(@.spec.to.name=="ibm-nginx-svc")].spec.host}')/api/v1/usermgmt/v1/role/<extension_name>

Example:

curl -X PUT -k -v -H "Authorization: Bearer eyJhbGciOiJSUz..." --header "Content-Type: application/json" --header "Accept: application/json" -d '{"role_name":"mycustomrole","description":"","permissions":["can_view_application_runtime_security"]}' https://$(oc get route -n wasautomation -o jsonpath='{.items[?(@.spec.to.name=="ibm-nginx-svc")].spec.host}')/api/v1/usermgmt/v1/role/f60b72c3-ae7e-4860-8f98-649e316af6d2

Response (truncated):

{"id":"f60b72c3-ae7e-4860-8f98-649e316af6d2","messageCode":"success","message":"success"}

Fix installation fails with a Connection error: read operation timed out error message

If the installation of a fix fails with this error in the runbook.log file, click Install fix in the UI at a later time to restart the installation of the fix.

Fix installation fails on WebSphere Application Server Liberty for a non-root user with Installation Manager in group mode on Linux or UNIX

This error occurs because the InstallationManager.dat file that WebSphere Automation accesses is not located in the non-root user's home directory as expected. To work around this problem, create an InstallationManager.dat file in the non-root user's home directory that has a symbolic link to the actual location of the InstallationManager.dat file. Refer to the following example.

ln -s /<my_group_name>/InstallationManager_AppData/etc/.ibm/registry/InstallationManager.dat \
/home/<non-root-username>/etc/.ibm/registry/InstallationManager.dat

Errors in running status or synchronization of node agent after a fix pack is installed

After you use WebSphere Automation to install a fix pack on a node in WebSphere Application Server Network Deployment, you might see one of the following problems:

An incorrect running status for the node in the admin console
An incorrect synchronization for the node in the admin console

An error similar to the following one in the SystemOut.log file:

ADMD0026W: The version of the deployment manager (9.0.5.11) is earlier than that of this node (node1, 9.0.5.12).

These problems occur because the fix pack version of the node is higher than the version of the deployment manager host. To resolve the problem, manually update the deployment manager host to a version equal to or higher than the fix pack version.

Installation of the fix cannot proceed error message

If the Installation of the fix cannot proceed error message appears, it might be caused by one of the following problems:

A communication problem might exist between WebSphere Automation and IBM Fix Central.
A configuration problem might exist that prevents WebSphere Automation from authenticating with IBM Fix Central.
A user privilege problem might exist that prevents WebSphere Automation from installing the fix on the managed server.

Check the configurations to ensure that they are correct. If the configurations are correct and a communication problem is suspected, try initiating the fix again after approximately one hour.

Problem with request to install fix error message

If the Problem with request to install fix error message appears, it is because more than one fix installation was initiated on a particular host. Only one fix can be installed on a particular host at a time. Wait until the current fix installation process is complete before you attempt to install another fix on that host.

Installation of a fix on a Windows server stalls

If the process of installing a fix on a Windows server stalls for an unreasonable amount of time, restart the Windows server and then restart the installation of the fix.

Install-fix pod is stalled with ContainerStatusUnknown with installing fixes

It is possible for a fix installation to stall and show Installing fix in the WebSphere Automation UI, and for the install-fix pod to remain in ContainerStatusUnknown state. While in this condition, subsequent installation attempts on the same host do not proceed and result in the following error message.

WIORM0806E: Other fixes are being installed on the host 'myhost.com'. Try again later.

To check your pod status, run the oc get pod command. Look for the ContainerStatusUnknown state.

oc get pod | grep install-fix
install-fix-f6054b58-f20d-4351-8c44-7c1efd93f2d5-9m89j    0/1    ContainerStatusUnknown   1    48m

To work around this issue, you must delete the stalled installation that never proceeds past the in-progress status and then delete the related install-fix job.

Delete the stalled installation with Swagger UI or CLI commands.

To delete the stalled installation with Swagger UI, find its installationId value and then use the value in a DELETE operation. For general instructions on how to use Swagger UI, see WebSphere Automation "How To" Series #10: How to view WebSphere Automation REST APIs using Swagger UI.
1. Find the installation on a host that is still in in-progress state with a GET operation on /installations that uses the hostName and status query parameters.
2. Delete those installations with a DELETE operation that uses the installationId value.
```
DELETE /installations/{installationId}
```
To delete the stalled installation with CLI commands, first get the token and URL values and then use the WebSphere Automation REST APIs through CLI to delete the installation.
1. Get the token value. Viewing the REST API details how to acquire the token value for an authorized user profile.
  1. Get the password for the administrator account.
```
oc -n WSA_INSTANCE_NAMESPACE get secret ibm-iam-bindinfo-platform-auth-idp-credentials -o jsonpath='{.data.admin_password}' | base64 -d && echo
```
    WSA_INSTANCE_NAMESPACE is the namespace of the instance where WebSphere Automation is installed; if the default value was chosen at installation, the value is websphere-automation.
  2. Replace <password> in the following command with the value returned from the command in the previous step, and use the correct value for WSA_INSTANCE_NAMESPACE.
```
curl -k -X POST -H 'Content-Type: application/json' -d '{"username":"cpadmin","password":"<password>"}' https://$(oc get route -n WSA_INSTANCE_NAMESPACE -o jsonpath='{.items[?(@.spec.to.name=="ibm-nginx-svc")].spec.host}')/icp4d-api/v1/authorize | jq -r .token
```
2. Copy the result of the curl command into a TOKEN variable.
3. Get the needed URL value to use in the curl commands. Append a prefix of https:// and a suffix of /websphereauto/secvul/apis/v1 around the result of the following command.
```
oc get route -n WSA_INSTANCE_NAMESPACE -o jsonpath='{.items[?(@.spec.to.name=="ibm-nginx-svc")].spec.host}'
```
  To set a URL variable on Linux, you can use the following command.
```
URL=https://$(oc get route -n WSA_INSTANCE_NAMESPACE
 -o jsonpath='{.items[?(@.spec.to.name=="ibm-nginx-svc")].spec.host}')/websphereauto/secvul/apis/v1
```
4. Get the installationId for the installation stuck in-progress on a specific host. You can use the following command after you replace HOSTNAME.
```
curl -k -X GET "${URL}/installations?hostName=HOSTNAME&status=in-progress" -H "accept: application/json" -H "Authorization: Bearer $TOKEN" | jq . | grep id
```
5. Delete that installation. Replace INSTALLATIONID in the command with the installationId value from the previous step.
```
curl -k -X DELETE "${URL}/installations/INSTALLATIONID" -H "accept: application/json" -H "Authorization: Bearer $TOKEN"
```

Delete the install-fix job.

Get the job name with the oc get job command.

oc get job | grep install-fix
install-fix-306dfd00-cb07-456c-91bb-7d3be8e5c0d7   0/1      14h   14h

Delete the related install-fix job with the oc delete job job_name command.
```
oc delete job install-fix-306dfd00-cb07-456c-91bb-7d3be8e5c0d7
```

iFix installation on a target server with an AIX operating system fails with error chmod: A flag or octal number is not correct

This error is related to the use of Ansible on the AIX operating system when both the connection user and the become_user are unprivileged. To prevent this problem from recurring, do the following steps:

Add --from-literal=ansible_pipelining=true to your secret.
Disable requiretty in your /etc/sudoers file for all managed hosts.
You can do this by commenting out the Defaults requiretty line, as shown in the following example.
```
#Defaults requiretty
```

Failure to invoke a webhook after a memory leak is detected

Unexpected updates to the Instana alerts schema can cause this problem. WebSphere Automation uses a JSON schema to validate the JSON that is sent from Instana. The schema that WebSphere Automation uses is set in the wsa-schema-instana-alerts config map. Ensure that the environment variable $WSA_INSTANCE_NAMESPACE is set to your WebSphere Automation instance namespace.

Retrieve the default Instana Alerts schema from the default ConfigMap as a local file named instana-alerts-custom.json.

oc get cm wsa-schema-instana-alerts -n $WSA_INSTANCE_NAMESPACE -o "jsonpath={.data['instanaAlerts\.json']}" > instana-alerts-custom.json

Make the necessary changes to the instana-alerts-custom.json JSON file.

Create the custom ConfigMap.

oc create cm wsa-schema-instana-alerts-custom -n $WSA_INSTANCE_NAMESPACE --from-file=instanaAlerts.json=instana-alerts-custom.json

Bulletin import job fails in an air gap installation

In an air gap installation, the wsa-secure-bulletins-import pods might fail to complete. For example, if you run the following command:

oc get pods | grep import

You might see output with errors:

wsa-secure-bulletins-import-1.6.0-8526l                    0/1     Error       0               2d15h
wsa-secure-bulletins-import-1.6.0-b7jld                    0/1     Error       0               2d15h
wsa-secure-bulletins-import-1.6.0-c4cxf                    0/1     Error       0               2d15h
wsa-secure-bulletins-import-1.6.0-dsmdg                    0/1     Error       0               2d15h
wsa-secure-bulletins-import-1.6.0-fgj7p                    0/1     Error       0               2d21h
wsa-secure-bulletins-import-1.6.0-kw9qm                    0/1     Error       0               2d15h
wsa-secure-bulletins-import-1.6.0-t5sgl                    0/1     Error       0               2d15h

If so, delete the bulletins import job.

oc delete job wsa-secure-bulletins-import-1.6.0

A new bulletins import job is created.

The timeout was exceeded accessing the Operate > Application runtimes page

When the WebSphereSecure UI is not able to contact the Platform UI instance when it uses the Red Hat OpenShift provisioned certificates, it results in this problem. When the application fails to communicate, the browser experiences a timeout, displaying the following error:

timeout of 20000ms exceeded

To solve this problem, delete the WebSphereSecure UI deployment (<instance-name>-secure-ui) to restart the application.

For example, with a WebSphere Automation instance name wsa, you should delete the wsa-secure-ui.

Delete the related deployment with the oc delete deployment deployment_name command.

oc delete deployment <instance-name>-secure-ui -n <namespace>

In a FIPS-enabled environment, installations of fixes or memory leak investigation do not progress

If you use an SSH key pair that was generated on a non-FIPS system in a FIPS-enabled installation of WebSphere Automation, application of fixes or memory leak investigations might not progress. The pod log may stop at the following lines:

[07/28/23 18:27:00:516 UTC] 1    com.ibm.ws.automation.core.runbook.runner.RunbookRunnerCLI INFO start Request received to execute runbook: install-fix against server: server1.example.com (correlationId: 65301e97-5754-4001-afbe-0c669d6774ff)
[07/28/23 18:27:00:607 UTC] 1    com.ibm.ws.automation.core.runbook.runner.AnsibleRunner INFO runRunbook Here is the standard output of the command:

[07/28/23 18:27:00:613 UTC] 1    com.ibm.ws.automation.core.runbook.runner.AnsibleRunner INFO runRunbook was:
[07/28/23 18:27:00:613 UTC] 1    com.ibm.ws.automation.core.runbook.runner.AnsibleRunner INFO runRunbook   hosts:
[07/28/23 18:27:00:613 UTC] 1    com.ibm.ws.automation.core.runbook.runner.AnsibleRunner INFO runRunbook     server1.example.com:
[07/28/23 18:27:00:613 UTC] 1    com.ibm.ws.automation.core.runbook.runner.AnsibleRunner INFO runRunbook       ansible_user: root
[07/28/23 18:27:00:625 UTC] 1    com.ibm.ws.automation.core.runbook.runner.AnsibleRunner INFO runRunbook Agent pid 41

This problem occurs because the ssh-keygen command on a non-FIPS system uses the MD5 digest algorithm to generate keys. On a FIPS-enabled system, the MD5 digest algorithm is disabled. SSH key pairs with no passphrase are not affected.

When running WebSphere Automation on a FIPS-enabled cluster, choose one of the following options to use a passphrase protected SSH key pair on a FIPS-enabled system.

Generate a new passphrase protected SSH key pair on a FIPS-enabled system.

Convert the existing private key to a FIPS-compatible format:

$ openssl pkcs8 -topk8 -v2 aes128 -in <INPUT FILENAME> -out <OUTPUT FILENAME>
Enter pass phrase for id_rsa:   <PASSPHRASE OF EXISTING KEY>
Enter Encryption Password:      <PASSPHRASE FOR NEW KEY>
Verifying - Enter Encryption Password:    <PASSPHRASE FOR NEW KEY>

Email notifications are not sent, Can't verify identity of server error message in the wsa-secure-vulnerability-notifier pod log

Starting in WebSphere Automation 1.7, a more secure Javamail service is used than in previous versions, which enforces server identity when establishing the TLS connection. If the certificate does not match the hostname of the mail server, a secure connection cannot be established and no emails are sent.

To disable hostname verification when sending emails, you can set a mail.smtps.ssl.checkserveridentity property to false in the Notifications page.

Log in to WebSphere Automation as a user with Manage notifications privileges.
Click menu icon > Operate > Application runtimes and then open the Notifications page.
On the Notifications page, under Email server, click Configure server.
Click the Add button under Add additional fields.
Add a mail.smtps.ssl.checkserveridentity parameter, and set its value to false.
Click Save.

Runbook logs contain file access permission errors

WebSphere Automation playbooks require that the user profile that is defined for the ansible_user connection variable has read access to the following WebSphere Application Server traditional files and their parent directories.

app_server_root/properties/profileRegistry.xml
app_server_root/properties/version/installed.xml

If the profile that is defined for the ansible_user connection variable cannot read the files, you might see errors similar to the following in the runbook logs:

ValueError: File not found or no permissions to access app_server_root/properties/version/installed.xml

Permission denied: 'app_server_root/properties/profileRegistry.xml'

To grant read access permissions to the files, use the operating system tools to change the file permissions. For example:

chmod +r /opt/IBM/WebSphere/AppServer/properties/profileRegistry.xml
chmod +r /opt/IBM/WebSphere/AppServer/properties
chmod +r /opt/IBM/WebSphere/AppServer/properties/version/installed.xml
chmod +r /opt/IBM/WebSphere/AppServer/properties/version

Additional information that might be helpful can be found in Granting write permission for profile-related tasks in the WebSphere Application Server documentation. See step 8 in particular and keep in mind that the use case in the linked documentation is different; do not follow those instructions explicitly.

No iFix option for fixing vulnerabilities in a Java SDK runtime

In the Prepare fix dialog, no interim fixes (iFixes) are shown as options to fix vulnerabilities in a Java SDK runtime. You must select Fix Pack as the Fix type on the Choose global options page, and then choose a fix pack to install on the Choose fixes page.