Member timeout and failure due to RDMA communication failure

Diagnosis and resolution of RDMA communication timeouts causing a member to sporadically fail and restart.

Symptoms

A member sporadically experiences failures during startup or normal day-to-day OLTP processing. The member is subsequently restarted successfully on its home host, or restarted light onto another host if restarting on the home host is not possible. Informational stack traceback files, dump files, and other data that is normally dumped is written into the diagpath and cf_diagpath.

A possible symptom is that there is an increased pace of disk space usage within the db2dump file system due to the writing of such diagnostics data.

Diagnosis and resolution

To diagnose and resolve this problem carry out the following steps:
  • Check the ~/sqllib_shared/db2dump/ $m to see whether the member db2diag log file is abnormally large or many diagnostic dump directories exist within it. On checking the db2diag log file, there might be messages from function pdLogCaPrintf :
    2009-01-23-17.33.23.976179-300 I5463632A503       LEVEL: Severe
    PID     : 602310               TID  : 15422       PROC : db2sysc 0
    INSTANCE:                      NODE : 000         DB   :     
    APPHDL  : 0-53                 APPID:              
    AUTHID  :        
    EDUID   : 15422                EDUNAME: db2agent (     ) 0
    FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876
    DATA #1 : <preformatted>
    xport_send: Timed-out waiting for completion of dat_ep_post_rdma_write of an MCB
    
    2009-01-23-17.33.24.473642-300 I5464136A467       LEVEL: Severe
    PID     : 602310               TID  : 15422       PROC : db2sysc 0
    INSTANCE:                      NODE : 000         DB   :      
    APPHDL  : 0-53                 APPID:                         
    AUTHID  :        
    EDUID   : 15422                EDUNAME: db2agent (    ) 0
    FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876
    DATA #1 : <preformatted>
    ClientXport.send (CARAR:) failed: 0x80090013
    
  • Check the db2diag log file for Restart Light messages after the aforementioned diag messages. See Restart events that might occur in Db2 pureScale environments for more information about the various restart messages including the Restart Light message.
    2009-08-27-23.37.52.416270-240 I6733A457        LEVEL: Event
    PID     : 1093874              TID  : 1         KTID : 2461779
    PROC    : db2star2
    INSTANCE:                      NODE : 001
    HOSTNAME: hostC
    EDUID   : 1
    FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368
    MESSAGE : Idle process taken over by member
    DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes
    996
    DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes
    1
  • After locating the pdLogCfPrintf messages, search for the diag message string CF RC=. For example, CF RC= 2148073491
  • Take the numeric value adjacent to this string, in this example it is 2148073491. This represents the reason code from the network or communications layer.
  • To find more details on this error, use the db2diag tool. For example, db2diag -cfrc 2148073491
  • Ping the cluster caching facility to see if it is online. If the ping is successful, gather a db2support package by running db2support output_directory -d database_name -s on each cluster and contact IBM Technical Support.
  • A RDMA trace might be requested by IBM Service for diagnosing such problems, see Running a trace for uDAPL over InfiniBand connections.