Identifying uDAPL over InfiniBand communication errors

Using diagnostic logs, operating system commands and system traces you can identify and resolve uDAPL communication errors.

Important: Starting from version 11.5.5, support for Infiniband (IB) adapters as the high-speed communication network between members and CFs in Db2® pureScale® on all supported platforms is deprecated and will be removed in a future release. Use Remote Direct Memory Access over Converged Ethernet (RoCE) network as the replacement.

After you enter db2start, first connection activation of the database or member restart, errors might occur, as shown in the following examples of messages in a db2diag log file:

2009-04-27-15.41.03.299437-240 I9450505A370       LEVEL: Severe
PID     : 651462               TID  : 258         KTID : 2674775 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000
EDUID   : 258                  EDUNAME: db2sysc 0
FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876
DATA #1 : <preformatted>
ca_svr_connect: dat_evd_wait failed: 0xf0000

2009-04-27-15.41.03.363542-240 I9450876A367       LEVEL: Severe
PID     : 651462               TID  : 258         KTID : 2674775 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000
EDUID   : 258                  EDUNAME: db2sysc 0
FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876
DATA #1 : <preformatted>
CAConnect: cmd_connect failed: 0x80090001

2009-04-27-15.41.03.421934-240 I9451244A1356      LEVEL: Severe
PID     : 651462               TID  : 258         KTID : 2674775 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000
EDUID   : 258                  EDUNAME: db2sysc 0
FUNCTION: Db2, Shared Data Structure Abstraction Layer , 
          SQLE_CA_CONN_ENTRY_DATA::sqleCaCeConnect, probe:622
MESSAGE : CA RC= 2148073473
DATA #1 : String, 17 bytes
CAConnect failed.
DATA #2 : CAToken_t, PD_TYPE_SD_CATOKEN, 8 bytes
0x07000000003EE0B8 : 0000 0001 1064 AF90                        .....d..
DATA #3 : CA Retry Position, PD_TYPE_SAL_CA_RETRY, 8 bytes
0
CALLSTCK:
  [0] 0x0900000012FF274C sqleCaCeConnect__23SQLE_CA_CONN_ENTRY_DATAFCP7CATokenCl
      + 0x40C
  [1] 0x0900000012FF2CF8 sqleSingleCaCreateAndAddNewConnectionsToPool__
      21SQLE_SINGLE_CA_HANDLEFCUlT1Cb + 0x278
  [2] 0x0900000012FF9188 sqleSingleCaInitialize__21SQLE_SINGLE_CA_HANDLEFRC27SQLE_
      CA_CONN_POOL_NODE_INFOCUlP13SQLO_MEM_POOL + 0x448
  [3] 0x0900000013001C50 sqleCaCpAddCa__17SQLE_CA_CONN_POOLFsCPUl + 0x350
  [4] 0x00000001000118AC sqleInitSysCtlr__FPiT1 + 0x140C
  [5] 0x0000000100013008 sqleSysCtlr__Fv + 0x4A8
  [6] 0x0900000012E15C78 sqloSystemControllerMain__FCUiPFv_iPFi_vPPvCPi + 0xD58
  [7] 0x0900000012E177AC sqloRunInstance + 0x20C
  [8] 0x0000000100006ECC DB2main + 0xAEC
  [9] 0x0900000012C99048 sqloEDUMainEntry__FPcUi + 0xA8
The db2diag log file might also show messages similar to the following ones:
2009-04-27-15.41.04.595936-240 I9453087A387       LEVEL: Severe
PID     : 1249362              TID  : 258         KTID : 4395063 PROC : db2sysc 1
INSTANCE: db2inst1             NODE : 001
EDUID   : 258                  EDUNAME: db2sysc 1
FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876
DATA #1 : <preformatted>
xport_send: dat_ep_post_rdma_write of the MCB failed: 0x70000

2009-04-27-15.42.04.329724-240 I9505628A1358      LEVEL: Severe
PID     : 1249362              TID  : 258         KTID : 4395063 PROC : db2sysc 1
INSTANCE: db2inst1             NODE : 001
EDUID   : 258                  EDUNAME: db2sysc 1
FUNCTION: Db2, Shared Data Structure Abstraction Layer , 
          SQLE_CA_CONN_ENTRY_DATA::sqleCaCeConnect, probe:622
MESSAGE : CA RC= 2148073485
DATA #1 : String, 17 bytes
CAConnect failed.
DATA #2 : CAToken_t, PD_TYPE_SD_CATOKEN, 8 bytes
0x07000000003EE0B8 : 0000 0001 1064 AFD0                        .....d..
DATA #3 : CA Retry Position, PD_TYPE_SAL_CA_RETRY, 8 bytes
894
CALLSTCK:
  [0] 0x0900000012FF274C sqleCaCeConnect__23SQLE_CA_CONN_ENTRY_DATAFCP7CATokenCl 
      + 0x40C
  [1] 0x0900000012FF2CF8 sqleSingleCaCreateAndAddNewConnectionsToPool__
      21SQLE_SINGLE_CA_HANDLEFCUlT1Cb + 0x278
  [2] 0x0900000012FF9188 sqleSingleCaInitialize__21SQLE_SINGLE_CA_HANDLEFRC27SQLE_
      CA_CONN_POOL_NODE_INFOCUlP13SQLO_MEM_POOL + 0x448
  [3] 0x0900000013001C50 sqleCaCpAddCa__17SQLE_CA_CONN_POOLFsCPUl + 0x350
  [4] 0x00000001000118AC sqleInitSysCtlr__FPiT1 + 0x140C
  [5] 0x0000000100013008 sqleSysCtlr__Fv + 0x4A8
  [6] 0x0900000012E15C78 sqloSystemControllerMain__FCUiPFv_iPFi_vPPvCPi + 0xD58
  [7] 0x0900000012E177AC sqloRunInstance + 0x20C
  [8] 0x0000000100006ECC DB2main + 0xAEC
  [9] 0x0900000012C99048 sqloEDUMainEntry__FPcUi + 0xA8
These messages indicate a communication error between a CF and a member. Follow these steps:
  1. Locate the pdLogCfPrintf messages and search for the message string CF RC=. For example, CF RC= 2148073491.
  2. Take the numeric value adjacent to this string; in this example it is 2148073491. This value represents the reason code from the network or communication layer.
  3. To find more details on this error, run the db2diag tool with the -cfrc parameter followed by the value. Example: db2diag -cfrc 2148073491.
  4. If the system was recently enabled with uDAPL and InfiniBand, check your uDAPL configuration. For details, see Configuring the network settings of hosts in a Db2 pureScale environment on an InfiniBand network (AIX®).
  5. Ping the IB hostnames from each member host that is showing the previously listed errors to the CFs IB hostnames, and from the CF hosts to the IB hostnames of those members.
  6. If pinging the IB hostnames fails, verify that the port state is up. To verify if the port state is up, run ibstat -v. In the following example, the link should be good because Physical Port Physical State has a value of Link Up, Logical Port State has a value of Active, and Physical Port State has a value of Active:
    $ ibstat -v
    ------------------------------------------------------------------------------
     IB NODE INFORMATION (iba0)
    ------------------------------------------------------------------------------
    Number of Ports:                        2
    Globally Unique ID (GUID):              00.02.55.00.02.38.59.00
    Maximum Number of Queue Pairs:          16367
    Maximum Outstanding Work Requests:      32768
    Maximum Scatter Gather per WQE:         252
    Maximum Number of Completion Queues:    16380
    Maximum Multicast Groups:               32
    Maximum Memory Regions:                 61382
    Maximum Memory Windows:                 61382
    Hw Version info:                        0x1000002
    Number of Reliable Datagram Domains:    0
    Total QPs in use:                       3
    Total CQs in use:                       4
    Total EQs in use:                       1
    Total Memory Regions in use:            7
    Total MultiCast Groups in use:          2
    Total QPs in MCast Groups in use:       2
    EQ Event Bus ID:                        0x90000300
    EQ Event ISN:                           0x1004
    NEQ Event Bus ID:                       0x90000300
    NEQ Event ISN:                          0x90101
    
    ------------------------------------------------------------------------------
     IB PORT 1 INFORMATION (iba0)
    ------------------------------------------------------------------------------
    Global ID Prefix:                       fe.80.00.00.00.00.00.00
    Local ID (LID):                         000e
    Local Mask Control (LMC):               0000
    Logical Port State:                     Active
    Physical Port State:                    Active
    Physical Port Physical State:           Link Up
    Physical Port Speed:                    2.5G
    Physical Port Width:                    4X
    Maximum Transmission Unit Capacity:     2048
    Current Number of Partition Keys:       1
    Partition Key List:
      P_Key[0]:                             ffff
    Current Number of GUID's:               1
    Globally Unique ID List:
      GUID[0]:                              00.02.55.00.02.38.59.00
    
  7. Check the Galaxy InfiniBand adapter card, InfiniBand switch, and cable connections for failures on the physical server.
  8. The AIX system error log might also show related messages. You can check the error log by running the errpt -a command.
  9. Ensure that the InfiniBand network interface, the host channel adapter, and the icm values all are Available, as shown in the following example:
    $ lsdev -C | grep ib
    fcnet0      Defined   00-08-01 Fibre Channel Network Protocol Device
    fcnet1      Defined   00-09-01 Fibre Channel Network Protocol Device
    ib0         Available          IP over Infiniband Network Interface
    iba0        Available          InfiniBand host channel adapter
    icm         Available          Infiniband Communication Manager
    
    • If setup was performed correctly, and the hardware is functioning correctly, all three values should be 'Available'.
    • If the network interface is not 'Available', you can change the device state manually. To change the device state manually you can use the following command:
      $ chdev -l ib0 -a state=up
      ib0 changed
      
    • If iba0 or icm are not in the Available state, check for errors on the device. To check for errors on the device, run /usr/sbin/cfgmgr -vl iba0 or /usr/sbin/cfgmgr -vl icm as a root user.
    • If iba0 is not found or remains in the Defined state, confirm that the Host Channel Adapter is currently assigned to the host on the HMC.
  10. Verify that the cf-server processes were running on the CF server hosts at the time of the error. If the CF hosts were not up, not initialized, or were restarted at that time (when performing db2instance -list at the time, the primary CF was PRIMARY and the secondary was in PEER), check cfdump.out*, cfdiag*.log, and core files for more details. However, if the CF servers were up and initialized at the time of the error, then there might be a uDAPL communication problem.
  11. If a db2start command or a CONNECT statement was issued, to determine whether there is a different failure that caused these errors to appear as a side effect, see CF server failure.
  12. If this is not the case, a trace of the failing scenario is often useful to determine the cause for the error. If CF trace was enabled, dump it. To dump CF trace, run the following command: db2trc cf dump fileName where you define the value for the fileName parameter.
  13. To enable CF trace if it was not already enabled, run the following command: db2trc cf on -m "*.CF.xport_udapl.*.*" .
  14. IBM Service might additionally request an AIX system trace and AIX memory traces to facilitate problem determination.
  15. If CF trace on xport_udapl and any AIX system trace were recorded, collect this information. Run the db2support command to collect further diagnostic logs. Run snap -Y as root on all hosts, and contact IBM Service for further help.