Cassandra query timeout notification from IBM Control Center

Edit online

If your system is configured for monitoring by IBM® Control Center, notifications are sent when a query failed because the results are not received within the RPC timeout value. The administrator must manually intervene to troubleshoot the issue with Cassandra.

Symptoms

A notification is received from IBM Control Center with the following message:

Cassandra queries are timing out

This message is delivered with the following details:

nodeId: Originating data center and node that was trying to query Cassandra node in the format gm/data center name/hostname.; If the query originated from Sterling B2B Integrator, the format is gm/data center name/gmca/hostname
location: Data center name
dataObject: Data objects (DAO) or operation that was being performed when it failed
reason: Exception string from Cassandra
requiredConsistency: Desired consistency level that was expected
expectedResponses: Number of nodes expecting to receive a response from (value retrieved from Cassandra exception)
actualResponses: Number of nodes that responded (value retrieved from Cassandra exception)
status: WARNING

Causes

The Cassandra cluster might be offline for the following list of reasons:

Server down

The server might not be running.

Incorrect configuration

The number of Cassandra nodes that are configured in the global.properties file is not equal to the number of Cassandra nodes that are deployed in your cluster.
One or more nodes might be incorrectly configured in the global.properties file.

Network problem

There is a network partition between the Global Mailbox instance and one or more Cassandra nodes.

Environment

Windows, UNIX, or Linux®.

Diagnosing the problem

If you suspect that the Cassandra cluster is offline, you can search the messages.log file for the CBXMD0040E error code at the time of the failure:

Go to the <install_directory>/usr/servers/defaultServer/logs directory.
Open the messages.log file.

Examine the log for events with the error ID CBXMD0040E and the following error message:

An error has occurred while trying to connect to Cassandra.

The format of the error message is as follows:

[time stamp] [thread ID] [logging class] [logging level] [error ID]:
 [error message]

The following example event includes the information that is logged by the messages.log file:

[mm/dd/yy hh:mm:ss:ms PDT] 00000063 
com.ibm.mailbox.database.dao.cassandra.CassandraDAO 
CBXMD0040E: An error has occurred while trying to connect to 
Cassandra.

Resolving the problem

If any of the events in the messages.log file indicate that the Cassandra cluster is offline, verify that your Cassandra cluster is correctly configured:

Collect information that defines the topology of your Cassandra cluster deployment:

Tip: If you do not have records that specify your Cassandra cluster topology, you can use the nodetool program to determine the number of Cassandra nodes in your Global Mailbox system:

Verify that JAVA_HOME is set to the location of IBM JDK 8.

From the command line, run bin/nodetool.

The following example is an output of the nodetool command:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns (effective)  Host ID                               Rack
UN  9.23.16.186  55.68 KB   256     49.1%             86282e2e-e4a2-4643-a077-0ca6ea32e138  rac1
UN  9.23.16.184  41.22 KB   256     47.9%             5ca91d43-9154-4b22-b1bb-4b432d0bdf43  rac1
Datacenter: datacenter2
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns (effective)  Host ID                               Rack
UN  9.23.25.148  45.19 KB   256     49.2%             215a6b13-fcc2-47ce-bf4d-cd7bfa8fc52c  rac1
UN  9.23.16.187  71.97 KB   256     53.7%             db01c093-96a8-4ee3-8a42-ede9a4ef41b5  rac1

To determine the total number of Cassandra nodes that are in your cluster, check the output for Address. The Address output provides the IP address of each Cassandra node in the cluster.

Verify that the number of Cassandra nodes that are configured in the global.properties file is the same number of nodes in your deployment:
1. In the <shared_files_dir>/config/ directory, locate the global.properties file.
  Important: The <shared_files_dir> is the base directory that you specified during installation.
2. Open the global.properties file in a text editor.
3. Count the number of nodes that are configured by identifying the com.ibm.mailbox.database.cassandra.host.1 property for the initially configured Cassandra node and the com.ibm.mailbox.database.cassandra.host.<n> property, for each additional node. Ensure that the same number of Cassandra nodes are configured that are also in your cluster.
Ensure that each Cassandra node is correctly configured in global.properties:
1. Verify that the correct IP address or host name is specified for each Cassandra node:
  
  com.ibm.mailbox.database.cassandra.host.1
  
  The IP address or host name of the first Cassandra node that was configured.
  
  com.ibm.mailbox.database.cassandra.host.<n>
  
  The IP address or host name of each additional Cassandra server that is configured in the cluster. Each Cassandra node in the cluster is represented by a unique integer value, <n>, for example, 2.
2. Verify that the correct RPC port is specified for the com.ibm.mailbox.database.cassandra.rpc.port property.
  Important: If you update the configuration of the global.properties file:
  1. Save the global.properties file.
  2. Copy the updated global.properties file to each data center in your Global Mailbox system.
If your Cassandra cluster is correctly configured in the global.properties file, check the status of the network for your Global Mailbox system.

Verifying the resolution

A notification is received from IBM Control Center with the following message:

Cassandra queries are completing

This message is delivered with the following details:

nodeId: Originating data center and node that was trying to query Cassandra node in the format gm/data center name/hostname or, if the query originated from Sterling B2B Integrator, the format is gm/data center name/gmca/hostname
location: Data center name
dataObject: Data objects (DAO) or operation that was being performed when it failed
requiredConsistency: Desired consistency level that was expected
precedingFailureCount: Number of failures for this problem that occurred before the success event was raised
status: UP