Cassandra query timeout notification from IBM Control Center
If your system is configured for monitoring by IBM® Control Center, notifications are sent when a query failed because the results are not received within the RPC timeout value. The administrator must manually intervene to troubleshoot the issue with Cassandra.
Symptoms
A notification is received from IBM Control Center with the following message:Cassandra queries are timing out
This
message is delivered with the following details:- nodeId
- Originating data center and node that was trying to query Cassandra node in the format gm/data center name/hostname.
- location
- Data center name
- dataObject
- Data objects (DAO) or operation that was being performed when it failed
- reason
- Exception string from Cassandra
- requiredConsistency
- Desired consistency level that was expected
- expectedResponses
- Number of nodes expecting to receive a response from (value retrieved from Cassandra exception)
- actualResponses
- Number of nodes that responded (value retrieved from Cassandra exception)
- status
- WARNING
Causes
The Cassandra cluster might be offline for the following list of reasons:
- Server down
- The server might not be running.
- Incorrect configuration
-
- The number of Cassandra nodes that are configured in the global.properties file is not equal to the number of Cassandra nodes that are deployed in your cluster.
- One or more nodes might be incorrectly configured in the global.properties file.
- Network problem
- There is a network partition between the Global Mailbox instance and one or more Cassandra nodes.
Environment
Windows, UNIX, or Linux®.Diagnosing the problem
If you suspect that the Cassandra cluster is offline, you can search the
messages.log file for the CBXMD0040E error code at the time of
the failure:
- Go to the <install_directory>/usr/servers/defaultServer/logs directory.
- Open the messages.log file.
- Examine the log for events with the error ID CBXMD0040E and the following error
message:
An error has occurred while trying to connect to Cassandra.
The format of the error message is as follows:[time stamp] [thread ID] [logging class] [logging level] [error ID]: [error message]
The following example event includes the information that is logged by the messages.log file:[mm/dd/yy hh:mm:ss:ms PDT] 00000063 com.ibm.mailbox.database.dao.cassandra.CassandraDAO CBXMD0040E: An error has occurred while trying to connect to Cassandra.
Resolving the problem
If any of the events in the messages.log file indicate that the Cassandra
cluster is offline, verify that your Cassandra cluster is correctly configured:
- Collect information that defines the topology of your Cassandra cluster deployment: Tip: If you do not have records that specify your Cassandra cluster topology, you can use the
nodetool
program to determine the number of Cassandra nodes in your Global Mailbox system:- Verify that JAVA_HOME is set to the location of IBM JDK 8.
- From the command line, run
bin/nodetool
.The following example is an output of the nodetool command:
Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 9.23.16.186 55.68 KB 256 49.1% 86282e2e-e4a2-4643-a077-0ca6ea32e138 rac1 UN 9.23.16.184 41.22 KB 256 47.9% 5ca91d43-9154-4b22-b1bb-4b432d0bdf43 rac1 Datacenter: datacenter2 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 9.23.25.148 45.19 KB 256 49.2% 215a6b13-fcc2-47ce-bf4d-cd7bfa8fc52c rac1 UN 9.23.16.187 71.97 KB 256 53.7% db01c093-96a8-4ee3-8a42-ede9a4ef41b5 rac1
- To determine the total number of Cassandra nodes that are in your cluster, check the output for
Address
. TheAddress
output provides the IP address of each Cassandra node in the cluster.
- Verify that the number of Cassandra nodes that are configured in the
global.properties file is the same number of nodes in your deployment:
- In the <shared_files_dir>/config/ directory, locate the
global.properties file.Important: The <shared_files_dir> is the base directory that you specified during installation.
- Open the global.properties file in a text editor.
- Count the number of nodes that are configured by identifying the com.ibm.mailbox.database.cassandra.host.1 property for the initially configured Cassandra node and the com.ibm.mailbox.database.cassandra.host.<n> property, for each additional node. Ensure that the same number of Cassandra nodes are configured that are also in your cluster.
- In the <shared_files_dir>/config/ directory, locate the
global.properties file.
- Ensure that each Cassandra node is correctly configured in
global.properties:
- Verify that the correct IP address or host name is specified for each Cassandra node:
- com.ibm.mailbox.database.cassandra.host.1
- The IP address or host name of the first Cassandra node that was configured.
- com.ibm.mailbox.database.cassandra.host.<n>
- The IP address or host name of each additional Cassandra server that is configured in the cluster. Each Cassandra node in the cluster is represented by a unique integer value, <n>, for example, 2.
- Verify that the correct RPC port is specified for the
com.ibm.mailbox.database.cassandra.rpc.port property.Important: If you update the configuration of the global.properties file:
- Save the global.properties file.
- Copy the updated global.properties file to each data center in your Global Mailbox system.
- Verify that the correct IP address or host name is specified for each Cassandra node:
- If your Cassandra cluster is correctly configured in the global.properties file, check the status of the network for your Global Mailbox system.
Verifying the resolution
A notification is received from IBM Control Center with the following message:Cassandra queries are completing
This
message is delivered with the following details:- nodeId
- Originating data center and node that was trying to query Cassandra node in the format gm/data center name/hostname or, if the query originated from Sterling B2B Integrator, the format is gm/data center name/gmca/hostname
- location
- Data center name
- dataObject
- Data objects (DAO) or operation that was being performed when it failed
- requiredConsistency
- Desired consistency level that was expected
- precedingFailureCount
- Number of failures for this problem that occurred before the success event was raised
- status
- UP