Member timeout and failure due to RDMA communication failure
Diagnosis and resolution of RDMA communication timeouts causing a member to sporadically fail and restart.
Symptoms
A member sporadically experiences failures during startup or normal day-to-day OLTP processing. The member is subsequently restarted successfully on its home host, or restarted light onto another host if restarting on the home host is not possible. Informational stack traceback files, dump files, and other data that is normally dumped is written into the diagpath and cf_diagpath.
A possible symptom is that there is an increased pace of disk space usage within the db2dump file system due to the writing of such diagnostics data.
Diagnosis and resolution
To diagnose and
resolve this problem carry out the following steps:
- Check the ~/sqllib_shared/db2dump/ $m to see
whether the member
db2diag log file is abnormally large or many diagnostic dump directories exist
within it. On checking the db2diag log file, there might be messages from
function
pdLogCaPrintf
:2009-01-23-17.33.23.976179-300 I5463632A503 LEVEL: Severe PID : 602310 TID : 15422 PROC : db2sysc 0 INSTANCE: NODE : 000 DB : APPHDL : 0-53 APPID: AUTHID : EDUID : 15422 EDUNAME: db2agent ( ) 0 FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876 DATA #1 : <preformatted> xport_send: Timed-out waiting for completion of dat_ep_post_rdma_write of an MCB 2009-01-23-17.33.24.473642-300 I5464136A467 LEVEL: Severe PID : 602310 TID : 15422 PROC : db2sysc 0 INSTANCE: NODE : 000 DB : APPHDL : 0-53 APPID: AUTHID : EDUID : 15422 EDUNAME: db2agent ( ) 0 FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876 DATA #1 : <preformatted> ClientXport.send (CARAR:) failed: 0x80090013
- Check the db2diag log file for Restart Light
messages after the aforementioned diag messages. See Restart events that might occur in Db2 pureScale environments for more information about
the various restart messages including the Restart Light message.
2009-08-27-23.37.52.416270-240 I6733A457 LEVEL: Event PID : 1093874 TID : 1 KTID : 2461779 PROC : db2star2 INSTANCE: NODE : 001 HOSTNAME: hostC EDUID : 1 FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368 MESSAGE : Idle process taken over by member DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes 996 DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes 1
- After locating the pdLogCfPrintf messages, search for the diag message string CF RC=. For example, CF RC= 2148073491
- Take the numeric value adjacent to this string, in this example
it is
2148073491
. This represents the reason code from the network or communications layer. - To find more details on this error, use the db2diag tool. For example, db2diag -cfrc 2148073491
- Ping the cluster caching facility to
see if it is online. If the ping is successful, gather a db2support package
by running
db2support output_directory -d database_name -s
on each cluster and contact IBM Technical Support. - A RDMA trace might be requested by IBM Service for diagnosing such problems, see Running a trace for uDAPL over InfiniBand connections.