Liberty collective troubleshooting

You might encounter a number of common issues when troubleshooting Liberty. The issues typically relate to configuration of the collective controller, member, or host system. Browse the list of issues to learn how to resolve the issues.

Issues involving connection
Issues involving start and stop commands
Issues involving collective administration
- Collective remove command did not remove an auto-scaled cluster member from the collective

For fixes to other issues, see Runtime environment known issues and restrictions.

Issues involving connection

CWWKX0217E: No MBean is currently registered with the given object_name

Message:

Error: CWWKX0217E: No MBean is currently registered with the given ObjectName 'WebSphere:feature=collectiveController,type=CollectiveRegistration,name=CollectiveRegistration'

Cause:

The MBean might not be available yet. Check the server logs to see if the MBean has reported ready.

The collective repository might not be running. Check to see if the collective repository has started.

If the target is a collective controller, verify that the replica set is active. If most of the collective controller replicas are not started, this message is displayed. Start the remaining replicas.

The server configuration might be incomplete. Make sure that the server is properly configured.

CWWKX0215E: There was a problem with the user name or password provided.

Message:

Error: CWWKX0215E: There was a problem with the user name or password provided. The server responded with code 401 and message 'Unauthorized'

Cause:

The user name and password might be incorrect. Make sure that the user name and password are correct for the target server.

The user might not be granted the Administrator role. Make sure that the user is granted the Administrative role, or choose a different user.

The security configuration for the target server might be incomplete. Make sure that the security configuration is defined and the security service reports as ready (CWWKS0008I).

CWWKX8057I: The collective member is unable to establish a connection to any of the collective controllers. Configured controllers: [host_name:port_number]

Message:

CWWKX8057I: The collective member is unable to establish a connection to any of the collective controllers. Configured controllers: [test.ibm.com:8889]

Cause:

The servers might not be running. Verify that the collective controller and member servers are running.

If the servers are running, determine whether the SSL configuration in the server.xml of the controller or a member changed recently. If this CWWKX8057I message occurs in the member, then the controller is more likely to have an incorrect SSL configuration. If this message occurs in the controller, then a member is more likely to have an incorrect SSL configuration. The problem can occur when a configuration does not use a quickStartSecurity element.

To fix a problem with the SSL configuration, check the following in the server.xml file:

Ensure that the administrator role is configured. You can map management roles to compare quickStartSecurity element settings to those for basic or LDAP registry.
Review and update the SSL configuration as needed.
1. Check any default SSL configuration. Look for <sslDefault sslRef="LDAPSSLSettings"></sslDefault> to see if the server.xml file contains a default configuration and to see if it is necessary to specify one SSL configuration as the default.
  If the <sslDefault sslRef="LDAPSSLSettings"></sslDefault> configuration is not necessary, remove it so the configuration has two or more SSL configurations (for example, one for the collective, one for LDAP) to use.
  
  If the default configuration is necessary, keep the <sslDefault sslRef="LDAPSSLSettings"></sslDefault> line and proceed to steps b and c to add client authentication and import certificates to the default SSL configuration. See Configuring LDAP user registries in Liberty.
2. Enable client authentication. Configure the HTTPS port for the server with clientAuthenticationSupported="true.
  For example, a collective controller that uses LDAP SSL default for security must have clientAuthenticationSupported="true in its ssl element to work with a collective member.
3. Import necessary certificates.
  You can use the keytool command to import certificates.
  
  If you are using the PKCS12 keystore, the default truststore must include certificates from the collectiveTrust.p12 file.
  
  If you are using the JKS keystore, the default truststore must include certificates from the collectiveTrust.jks file.

Background information:

By default, <sslDefault sslRef> points to defaultSSLSettings. Changing <sslDefault sslRef> in the collective controller configuration to point to something other than defaultSSLSettings, such as <sslDefault sslRef="LDAPSSLSettings"/>, causes the CWWKX8057I error unless the configuration has clientAuthenticationSupported="true" and the Liberty server trusts any SSL peer that has a client certificate.

SSL settings for HTTPS must trust the collective certificates. Changing <sslDefault sslRef> to point to something other than defaultSSLSettings without clientAuthenticationSupported="true" can unbind the defaultSSLSettings from their HTTPS configuration. The collective SSL settings are part of the default configuration.

For more information, see Mapping management roles for Liberty, Configuring LDAP user registries in Liberty, and Configuring an httpEndpoint to use an SSL configuration other than the default.

Error: Connection refused: connect

Message:

Error: Connection refused: connect

Cause:

The host and port might be incorrect. Make sure that the host and port are correct for the target server.

The server might not be running. Make sure that the server is running.

java.net.SocketException error

Message:

java.net.SocketException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext)(possibly others...)

Cause:

The truststore and truststore password might be incorrect. Make sure that the truststore path, truststore password, and contents of the truststore are correct.

Issues involving start and stop commands

Starting or stopping the servers remotely causes a Java not found error

Message:

Starting or stopping the servers remotely (by using ClusterManager.startCluster or ServerCommands.startServer for example) encounters the following error:

{stderr=java: javaCmd 14: serverCmd 32: ./server 873: FSUM7351 not found, stdout=, returnCode=127}

Solution:

The member servers need a server.env file that specifies a JAVA_HOME variable.

CTGRI0000E: Could not establish a connection to the target machine with the authorization credentials that were provided.

Message:

CTGRI0000E Could not establish a connection to the target machine with the authorization credentials that were provided.

Cause:

Authentication fails using user name or password:

Make sure that the user name and password are correct in the target server's server.xml <hostAuthConfig> element.
Update the host authentication configuration by using the collective updateHost command.

Authentication fails using ssh keys:

Check permissions on:
- ~/.ssh should be 0700
- ~/.ssh/authorized_keys should be 0600
~/.ssh and all children must be correct if using SELinux. Use restorecon -R to fix the permissions.

CTGRI0001E: The application could not establish a connection to host_name.

Message:

{ExceptionMessage=ConnectException caught while performing stopCluster operation on member webp1a.ibm.com,/P1A/WebSphere_LP/usr,memberA1: java.net.ConnectException: 
CTGRI0001E The application could not establish a connection to webp1a.ibm.com., Exception=java.net.ConnectException: CTGRI0001E The application could not establish a connection to webp1a.ibm.com.}

Cause:

Starting or stopping the servers remotely by using commands such as ClusterManager.startCluster or ServerCommands.startServer can cause the error.

Message CTGRI0001E, along with message CTGRI0026E, can indicate that too many concurrent SSH connections are made to a host. Possible causes are:

Autonomics such as scaling controller
Running ClusterManager.startCluster, ServerCommands.startServer, or other system management commands on a number of servers on a single host that exceeds the maximum number of concurrent unauthenticated connections to the SSH daemon.

Solution:

Confirm that the RPC mechanism (such as SSH) is started. Also confirm that the configured settings, such as host and port, are correct.

If your environment uses SSH, change the settings in the SSH configuration file. The SSH configuration MaxStartups setting has a default of 10 concurrent unauthenticated connections. Changing the MaxStartups setting in the SSH configuration file, /etc/ssh/sshd_config, can solve the problem. The MaxStartups setting specifies the maximum number of concurrent unauthenticated connections to the SSH daemon. Additional connections are dropped until authentication succeeds or the LoginGraceTime expires for a connection. You can enable random early drop by specifying the three colon separated values start:rate:full (for example, 10:30:60). sshd(8) refuses connection attempts with a probability of rate/100 (30%) if there are currently start (10) unauthenticated connections. The probability increases linearly and all connection attempts are refused if the number of unauthenticated connections reaches full (60). The following sample SSH configuration file settings specify MaxStartups and other settings that can alleviate connection problems:

ClientAliveInterval 60
ClientAliveCountMax 3
MaxSessions 100
MaxStartups 100:30:200
LoginGraceTime 180

For more information about Secure Shell (SSH) protocol and changing /etc/ssh/sshd_config settings, see Setting up RXA for Liberty collective operations.

CTGRI0026E A connection could not be completed to host_name during the specified timeout interval.

Message:

CTGRI0026E A connection could not be completed to webp1a.ibm.com during the specified timeout interval.

Cause:

Too many concurrent SSH connections to a host can cause this error.

Solution:

See the solution for message CTGRI0001E.

CWWKX6027E: The collective controller initialization did not succeed. The socket bind did not succeed for host host_name and port port_number. The port might already be in use or the host does not match the system configuration.

Message:

CWWKX6027E: The collective controller initialization did not succeed. The socket bind did not succeed for host * and port 10,010. The port might already be in use or the host does not match the system configuration.

Solution:

Ensure that the host value that is specified in the collective controller configuration is correct. For example, if the collective controller resides on myhost.com, check the server.xml file of the controller to ensure that the host value is correct:

<variable name="defaultHostName" value="myhost.com" />

The example message shows an asterisk (*) for host, suggesting that the host value probably did not cause the problem. The likely cause of the problem is a port conflict.

Ensure that the port number in the message is not already in use. At a command line on the host computer where the collective controller resides, run netstat -a to see a list of port numbers and the status of the connections. If the port number is in use, the list contains an entry such as the following for port 10,010:

TCP 127.0.0.1:10010 myhost:0 LISTENING

To fix this port conflict, open an editor on the server.xml file of the collective controller and add a statement that sets replicaPort to a port number that is not in use on the computer. Any of the following statements can set a replicaPort value:

<collectiveController replicaPort="10011"/>
<collectiveController replicaHost="myhost.com" replicaPort="10011"/>
<collectiveController replicaPort="${prop.controller_1.replica}"/>
Set the variable for the port number, which has the name prop.controller_1.replica in this statement but which can have any variable name you choose, in a bootstrap.properties file or in a <variable name="name" value="value"/> XML tag.

CWWKX7204E: Cannot connect to host host_name with the credentials provided.

Message:

localhost,C:/wlp,member1 stop operation resulted in an Exception: ConnectException caught while performing stopCluster operation on member localhost,C:/wlp,member1: java.net.ConnectException: 
CWWKX7204E: Cannot connect to host localhost with the credentials provided.

Solution:

Make sure that the cluster member authentication information is set correctly and that all Remote Execution and Access (RXA) requirements are met. Many RXA operations require access to resources that are not generally accessible by standard user accounts. See Setting up RXA for Liberty collective operations.

Issues involving collective administration

Collective remove command did not remove an auto-scaled cluster member from the collective

Message:

Specified server Cluster1 was not found at location ${wlp.install.dir}/Cluster1.zip/wlp/usr
No collective resources were removed.
Specified userDir ${wlp.install.dir}/Cluster1.zip/wlp/usr was not found

Solution:

The standard practice is to not manually remove auto-scaled cluster members from a collective. The scaling controller autonomically manages the collective. See Configuring provisionable clusters for Liberty elasticity.

However, if you need to remove a cluster member that a scaling controller added to a collective, you can set the WLP_USER_DIR and JAVA_HOME environment variables and then run the collective remove command to remove the cluster member from the collective.

Set WLP_USER_DIR to the location of the auto-scaled cluster member that you want to remove.

export WLP_USER_DIR=/wlp.usr/defaultStackGroup.Cluster1/d16d298c-ea7f-4782-bef6-2e54fb80de20/Cluster1.zip/wlp/usr

If Java cannot be run from the location of the auto-scaled cluster member, set JAVA_HOME so that you can run a Liberty command to remove the cluster member.
```
export JAVA_HOME=/wlp.jre/jre.18.zip/
```

Stop the cluster member.

/wlp/wlp.855.zip/wlp/bin/server stop Cluster1

Run the collective remove command to remove the cluster member from the collective.
```
/wlp/wlp.855.zip/wlp/bin/collective remove Cluster1 --host=controllerHost --port=9443 --user=admin --password=password
```
See Removing members from a Liberty collective.