Liberty collective troubleshooting
You might encounter a number of common issues when troubleshooting Liberty. The issues typically relate to configuration of the collective controller, member, or host system. Browse the list of issues to learn how to resolve the issues.
- Issues involving connection
- CWWKX0217E: No MBean is currently registered with the given object_name
- CWWKX0215E: There was a problem with the user name or password provided
- CWWKX8057I: The collective member is unable to establish a connection to any of the collective controllers. Configured controllers: [host_name:port_number]
- Error: Connection refused: connect
- java.net.SocketException error
- Issues involving start and stop commands
- Starting or stopping the servers remotely causes a Java™ not found error
- CTGRI0000E: Could not establish a connection to the target machine with the authorization credentials that were provided
- CTGRI0001E: The application could not establish a connection to host_name
- CTGRI0026E A connection could not be completed to host_name during the specified timeout interval
- CWWKX6027E: The collective controller initialization did not succeed. The socket bind did not succeed for host host_name and port port_number. The port might already be in use or the host does not match the system configuration.
- CWWKX7204E: Cannot connect to host host_name with the credentials provided
- Issues involving collective administration
For fixes to other issues, see Runtime environment known issues and restrictions.
Issues involving connection
- CWWKX0217E: No MBean is currently registered with the given object_name
- Message:
Error: CWWKX0217E: No MBean is currently registered with the given ObjectName 'WebSphere:feature=collectiveController,type=CollectiveRegistration,name=CollectiveRegistration'
Cause:The MBean might not be available yet. Check the server logs to see if the MBean has reported ready.
The collective repository might not be running. Check to see if the collective repository has started.
If the target is a collective controller, verify that the replica set is active. If most of the collective controller replicas are not started, this message is displayed. Start the remaining replicas.
The server configuration might be incomplete. Make sure that the server is properly configured.
- CWWKX0215E: There was a problem with the user name or password provided.
- Message:
Error: CWWKX0215E: There was a problem with the user name or password provided. The server responded with code 401 and message 'Unauthorized'
Cause:The user name and password might be incorrect. Make sure that the user name and password are correct for the target server.
The user might not be granted the Administrator role. Make sure that the user is granted the Administrative role, or choose a different user.
The security configuration for the target server might be incomplete. Make sure that the security configuration is defined and the security service reports as ready (
CWWKS0008I
). - CWWKX8057I: The collective member is unable to establish a connection to any of the collective controllers. Configured controllers: [host_name:port_number]
- Message:
CWWKX8057I: The collective member is unable to establish a connection to any of the collective controllers. Configured controllers: [test.ibm.com:8889]
Cause:The servers might not be running. Verify that the collective controller and member servers are running.
If the servers are running, determine whether the SSL configuration in the server.xml of the controller or a member changed recently. If this CWWKX8057I message occurs in the member, then the controller is more likely to have an incorrect SSL configuration. If this message occurs in the controller, then a member is more likely to have an incorrect SSL configuration. The problem can occur when a configuration does not use a
quickStartSecurity
element.To fix a problem with the SSL configuration, check the following in the server.xml file:
By default,
<sslDefault sslRef>
points todefaultSSLSettings
. Changing<sslDefault sslRef>
in the collective controller configuration to point to something other thandefaultSSLSettings
, such as<sslDefault sslRef="LDAPSSLSettings"/>
, causes the CWWKX8057I error unless the configuration hasclientAuthenticationSupported="true"
and the Liberty server trusts any SSL peer that has a client certificate.SSL settings for HTTPS must trust the collective certificates. Changing
<sslDefault sslRef>
to point to something other thandefaultSSLSettings
withoutclientAuthenticationSupported="true"
can unbind thedefaultSSLSettings
from their HTTPS configuration. The collective SSL settings are part of the default configuration.For more information, see Mapping management roles for Liberty, Configuring LDAP user registries in Liberty, and Configuring an httpEndpoint to use an SSL configuration other than the default.
- Error: Connection refused: connect
- Message:
Error: Connection refused: connect
Cause:The host and port might be incorrect. Make sure that the host and port are correct for the target server.
The server might not be running. Make sure that the server is running.
- java.net.SocketException error
- Message:
java.net.SocketException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext)(possibly others...)
Cause:The truststore and truststore password might be incorrect. Make sure that the truststore path, truststore password, and contents of the truststore are correct.
Issues involving start and stop commands
- Starting or stopping the servers remotely causes a Java not found error
- Message:
Starting or stopping the servers remotely (by using ClusterManager.startCluster or ServerCommands.startServer for example) encounters the following error:
{stderr=java: javaCmd 14: serverCmd 32: ./server 873: FSUM7351 not found, stdout=, returnCode=127}
Solution:The member servers need a server.env file that specifies a
JAVA_HOME
variable. - CTGRI0000E: Could not establish a connection to the target machine with the authorization credentials that were provided.
- Message:
CTGRI0000E Could not establish a connection to the target machine with the authorization credentials that were provided.
Cause:Authentication fails using user name or password:- Make sure that the user name and password are correct in the target server's
server.xml
<hostAuthConfig>
element. - Update the host authentication configuration by using the collective updateHost command.
Authentication fails using ssh keys: - Make sure that the user name and password are correct in the target server's
server.xml
- CTGRI0001E: The application could not establish a connection to host_name.
- Message:
{ExceptionMessage=ConnectException caught while performing stopCluster operation on member webp1a.ibm.com,/P1A/WebSphere_LP/usr,memberA1: java.net.ConnectException: CTGRI0001E The application could not establish a connection to webp1a.ibm.com., Exception=java.net.ConnectException: CTGRI0001E The application could not establish a connection to webp1a.ibm.com.}
Cause:Starting or stopping the servers remotely by using commands such as ClusterManager.startCluster or ServerCommands.startServer can cause the error.
Message CTGRI0001E, along with message CTGRI0026E, can indicate that too many concurrent SSH connections are made to a host. Possible causes are:- Autonomics such as scaling controller
- Running ClusterManager.startCluster, ServerCommands.startServer, or other system management commands on a number of servers on a single host that exceeds the maximum number of concurrent unauthenticated connections to the SSH daemon.
Solution:
Confirm that the RPC mechanism (such as SSH) is started. Also confirm that the configured settings, such as host and port, are correct.
If your environment uses SSH, change the settings in the SSH configuration file. The SSH configurationMaxStartups
setting has a default of 10 concurrent unauthenticated connections. Changing theMaxStartups
setting in the SSH configuration file, /etc/ssh/sshd_config, can solve the problem. TheMaxStartups
setting specifies the maximum number of concurrent unauthenticated connections to the SSH daemon. Additional connections are dropped until authentication succeeds or theLoginGraceTime
expires for a connection. You can enable random early drop by specifying the three colon separated valuesstart:rate:full
(for example,10:30:60
). sshd(8) refuses connection attempts with a probability ofrate/100
(30%) if there are currentlystart
(10) unauthenticated connections. The probability increases linearly and all connection attempts are refused if the number of unauthenticated connections reachesfull
(60). The following sample SSH configuration file settings specifyMaxStartups
and other settings that can alleviate connection problems:
For more information about Secure Shell (SSH) protocol and changing /etc/ssh/sshd_config settings, see Setting up RXA for Liberty collective operations.ClientAliveInterval 60 ClientAliveCountMax 3 MaxSessions 100 MaxStartups 100:30:200 LoginGraceTime 180
- CTGRI0026E A connection could not be completed to host_name during the specified timeout interval.
- Message:
CTGRI0026E A connection could not be completed to webp1a.ibm.com during the specified timeout interval.
Cause:Too many concurrent SSH connections to a host can cause this error.
Solution:See the solution for message CTGRI0001E.
- CWWKX6027E: The collective controller initialization did not succeed. The socket bind did not succeed for host host_name and port port_number. The port might already be in use or the host does not match the system configuration.
- Message:
CWWKX6027E: The collective controller initialization did not succeed. The socket bind did not succeed for host * and port 10,010. The port might already be in use or the host does not match the system configuration.
Solution:Ensure that the host value that is specified in the collective controller configuration is correct. For example, if the collective controller resides on
myhost.com
, check the server.xml file of the controller to ensure that the host value is correct:<variable name="defaultHostName" value="myhost.com" />
The example message shows an asterisk (*) for host, suggesting that the host value probably did not cause the problem. The likely cause of the problem is a port conflict.
Ensure that the port number in the message is not already in use. At a command line on the host computer where the collective controller resides, run
netstat -a
to see a list of port numbers and the status of the connections. If the port number is in use, the list contains an entry such as the following for port 10,010:TCP 127.0.0.1:10010 myhost:0 LISTENING
To fix this port conflict, open an editor on the server.xml file of the collective controller and add a statement that sets
replicaPort
to a port number that is not in use on the computer. Any of the following statements can set areplicaPort
value:<collectiveController replicaPort="10011"/>
<collectiveController replicaHost="myhost.com" replicaPort="10011"/>
<collectiveController replicaPort="${prop.controller_1.replica}"/>
Set the variable for the port number, which has the name
prop.controller_1.replica
in this statement but which can have any variable name you choose, in a bootstrap.properties file or in a<variable name="name" value="value"/>
XML tag.
- CWWKX7204E: Cannot connect to host host_name with the credentials provided.
- Message:
localhost,C:/wlp,member1 stop operation resulted in an Exception: ConnectException caught while performing stopCluster operation on member localhost,C:/wlp,member1: java.net.ConnectException: CWWKX7204E: Cannot connect to host localhost with the credentials provided.
Solution:Make sure that the cluster member authentication information is set correctly and that all Remote Execution and Access (RXA) requirements are met. Many RXA operations require access to resources that are not generally accessible by standard user accounts. See Setting up RXA for Liberty collective operations.
Issues involving collective administration
- Collective remove command did not remove an auto-scaled cluster member from the collective
- Message:
Specified server Cluster1 was not found at location ${wlp.install.dir}/Cluster1.zip/wlp/usr No collective resources were removed. Specified userDir ${wlp.install.dir}/Cluster1.zip/wlp/usr was not found
Solution:The standard practice is to not manually remove auto-scaled cluster members from a collective. The scaling controller autonomically manages the collective. See Configuring provisionable clusters for Liberty elasticity.
However, if you need to remove a cluster member that a scaling controller added to a collective, you can set the
WLP_USER_DIR
andJAVA_HOME
environment variables and then run the collective remove command to remove the cluster member from the collective.- Set
WLP_USER_DIR
to the location of the auto-scaled cluster member that you want to remove.export WLP_USER_DIR=/wlp.usr/defaultStackGroup.Cluster1/d16d298c-ea7f-4782-bef6-2e54fb80de20/Cluster1.zip/wlp/usr
- If Java cannot be run from the location of the auto-scaled cluster member, set
JAVA_HOME
so that you can run a Liberty command to remove the cluster member.export JAVA_HOME=/wlp.jre/jre.18.zip/
- Stop the cluster
member.
/wlp/wlp.855.zip/wlp/bin/server stop Cluster1
- Run the collective remove command to remove the cluster member from the
collective.
/wlp/wlp.855.zip/wlp/bin/collective remove Cluster1 --host=controllerHost --port=9443 --user=admin --password=password
- Set