Identifying uDAPL over InfiniBand communication errors
Using diagnostic logs, operating system commands and system traces you can identify and resolve uDAPL communication errors.
After you enter db2start, first connection activation of the database or member restart, errors might occur, as shown in the following examples of messages in a db2diag log file:
2009-04-27-15.41.03.299437-240 I9450505A370 LEVEL: Severe
PID : 651462 TID : 258 KTID : 2674775 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 258 EDUNAME: db2sysc 0
FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876
DATA #1 : <preformatted>
ca_svr_connect: dat_evd_wait failed: 0xf0000
2009-04-27-15.41.03.363542-240 I9450876A367 LEVEL: Severe
PID : 651462 TID : 258 KTID : 2674775 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 258 EDUNAME: db2sysc 0
FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876
DATA #1 : <preformatted>
CAConnect: cmd_connect failed: 0x80090001
2009-04-27-15.41.03.421934-240 I9451244A1356 LEVEL: Severe
PID : 651462 TID : 258 KTID : 2674775 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 258 EDUNAME: db2sysc 0
FUNCTION: Db2, Shared Data Structure Abstraction Layer ,
SQLE_CA_CONN_ENTRY_DATA::sqleCaCeConnect, probe:622
MESSAGE : CA RC= 2148073473
DATA #1 : String, 17 bytes
CAConnect failed.
DATA #2 : CAToken_t, PD_TYPE_SD_CATOKEN, 8 bytes
0x07000000003EE0B8 : 0000 0001 1064 AF90 .....d..
DATA #3 : CA Retry Position, PD_TYPE_SAL_CA_RETRY, 8 bytes
0
CALLSTCK:
[0] 0x0900000012FF274C sqleCaCeConnect__23SQLE_CA_CONN_ENTRY_DATAFCP7CATokenCl
+ 0x40C
[1] 0x0900000012FF2CF8 sqleSingleCaCreateAndAddNewConnectionsToPool__
21SQLE_SINGLE_CA_HANDLEFCUlT1Cb + 0x278
[2] 0x0900000012FF9188 sqleSingleCaInitialize__21SQLE_SINGLE_CA_HANDLEFRC27SQLE_
CA_CONN_POOL_NODE_INFOCUlP13SQLO_MEM_POOL + 0x448
[3] 0x0900000013001C50 sqleCaCpAddCa__17SQLE_CA_CONN_POOLFsCPUl + 0x350
[4] 0x00000001000118AC sqleInitSysCtlr__FPiT1 + 0x140C
[5] 0x0000000100013008 sqleSysCtlr__Fv + 0x4A8
[6] 0x0900000012E15C78 sqloSystemControllerMain__FCUiPFv_iPFi_vPPvCPi + 0xD58
[7] 0x0900000012E177AC sqloRunInstance + 0x20C
[8] 0x0000000100006ECC DB2main + 0xAEC
[9] 0x0900000012C99048 sqloEDUMainEntry__FPcUi + 0xA8
The db2diag log file might also show messages
similar to the following ones:
2009-04-27-15.41.04.595936-240 I9453087A387 LEVEL: Severe
PID : 1249362 TID : 258 KTID : 4395063 PROC : db2sysc 1
INSTANCE: db2inst1 NODE : 001
EDUID : 258 EDUNAME: db2sysc 1
FUNCTION: Db2, RAS/PD component, pdLogCaPrintf, probe:876
DATA #1 : <preformatted>
xport_send: dat_ep_post_rdma_write of the MCB failed: 0x70000
2009-04-27-15.42.04.329724-240 I9505628A1358 LEVEL: Severe
PID : 1249362 TID : 258 KTID : 4395063 PROC : db2sysc 1
INSTANCE: db2inst1 NODE : 001
EDUID : 258 EDUNAME: db2sysc 1
FUNCTION: Db2, Shared Data Structure Abstraction Layer ,
SQLE_CA_CONN_ENTRY_DATA::sqleCaCeConnect, probe:622
MESSAGE : CA RC= 2148073485
DATA #1 : String, 17 bytes
CAConnect failed.
DATA #2 : CAToken_t, PD_TYPE_SD_CATOKEN, 8 bytes
0x07000000003EE0B8 : 0000 0001 1064 AFD0 .....d..
DATA #3 : CA Retry Position, PD_TYPE_SAL_CA_RETRY, 8 bytes
894
CALLSTCK:
[0] 0x0900000012FF274C sqleCaCeConnect__23SQLE_CA_CONN_ENTRY_DATAFCP7CATokenCl
+ 0x40C
[1] 0x0900000012FF2CF8 sqleSingleCaCreateAndAddNewConnectionsToPool__
21SQLE_SINGLE_CA_HANDLEFCUlT1Cb + 0x278
[2] 0x0900000012FF9188 sqleSingleCaInitialize__21SQLE_SINGLE_CA_HANDLEFRC27SQLE_
CA_CONN_POOL_NODE_INFOCUlP13SQLO_MEM_POOL + 0x448
[3] 0x0900000013001C50 sqleCaCpAddCa__17SQLE_CA_CONN_POOLFsCPUl + 0x350
[4] 0x00000001000118AC sqleInitSysCtlr__FPiT1 + 0x140C
[5] 0x0000000100013008 sqleSysCtlr__Fv + 0x4A8
[6] 0x0900000012E15C78 sqloSystemControllerMain__FCUiPFv_iPFi_vPPvCPi + 0xD58
[7] 0x0900000012E177AC sqloRunInstance + 0x20C
[8] 0x0000000100006ECC DB2main + 0xAEC
[9] 0x0900000012C99048 sqloEDUMainEntry__FPcUi + 0xA8
These messages indicate a communication error between a CF and
a member.
Follow these steps:
- Locate the pdLogCfPrintf messages and search for
the message string
CF RC=
. For example,CF RC= 2148073491
. - Take the numeric value adjacent to this string; in this example
it is
2148073491
. This value represents the reason code from the network or communication layer. - To find more details on this error, run the db2diag tool
with the -cfrc parameter followed by the value.
Example:
db2diag -cfrc 2148073491
. - If the system was recently enabled with uDAPL and InfiniBand, check your uDAPL configuration. For details, see Configuring the network settings of hosts in a Db2® pureScale® environment on an InfiniBand network (AIX®).
- Ping the IB hostnames from each member host that is showing the previously listed errors to the CFs IB hostnames, and from the CF hosts to the IB hostnames of those members.
- If pinging the IB hostnames fails, verify that the port state
is up. To verify if the port state is up, run ibstat -v.
In the following example, the link should be good because Physical
Port Physical State has a value of Link Up, Logical Port State has
a value of Active, and Physical Port State has a value of Active:
$ ibstat -v ------------------------------------------------------------------------------ IB NODE INFORMATION (iba0) ------------------------------------------------------------------------------ Number of Ports: 2 Globally Unique ID (GUID): 00.02.55.00.02.38.59.00 Maximum Number of Queue Pairs: 16367 Maximum Outstanding Work Requests: 32768 Maximum Scatter Gather per WQE: 252 Maximum Number of Completion Queues: 16380 Maximum Multicast Groups: 32 Maximum Memory Regions: 61382 Maximum Memory Windows: 61382 Hw Version info: 0x1000002 Number of Reliable Datagram Domains: 0 Total QPs in use: 3 Total CQs in use: 4 Total EQs in use: 1 Total Memory Regions in use: 7 Total MultiCast Groups in use: 2 Total QPs in MCast Groups in use: 2 EQ Event Bus ID: 0x90000300 EQ Event ISN: 0x1004 NEQ Event Bus ID: 0x90000300 NEQ Event ISN: 0x90101 ------------------------------------------------------------------------------ IB PORT 1 INFORMATION (iba0) ------------------------------------------------------------------------------ Global ID Prefix: fe.80.00.00.00.00.00.00 Local ID (LID): 000e Local Mask Control (LMC): 0000 Logical Port State: Active Physical Port State: Active Physical Port Physical State: Link Up Physical Port Speed: 2.5G Physical Port Width: 4X Maximum Transmission Unit Capacity: 2048 Current Number of Partition Keys: 1 Partition Key List: P_Key[0]: ffff Current Number of GUID's: 1 Globally Unique ID List: GUID[0]: 00.02.55.00.02.38.59.00
- Check the Galaxy InfiniBand adapter card, InfiniBand switch, and cable connections for failures on the physical server.
- The AIX system error log might also show related messages. You can check the error log by running the errpt -a command.
- Ensure that the InfiniBand network interface, the host channel
adapter, and the icm values all are Available, as shown in the following
example:
$ lsdev -C | grep ib fcnet0 Defined 00-08-01 Fibre Channel Network Protocol Device fcnet1 Defined 00-09-01 Fibre Channel Network Protocol Device ib0 Available IP over Infiniband Network Interface iba0 Available InfiniBand host channel adapter icm Available Infiniband Communication Manager
- If setup was performed correctly, and the hardware is functioning correctly, all three values should be 'Available'.
- If the network interface is not 'Available', you can change the
device state manually. To change the device state manually you can
use the following command:
$ chdev -l ib0 -a state=up ib0 changed
- If iba0 or icm are not in the Available state, check for errors on the device. To check for errors on the device, run /usr/sbin/cfgmgr -vl iba0 or /usr/sbin/cfgmgr -vl icm as a root user.
- If iba0 is not found or remains in the Defined state, confirm that the Host Channel Adapter is currently assigned to the host on the HMC.
- Verify that the cf-server processes were running on the CF server
hosts at the time of the error. If the CF hosts
were not up, not initialized, or were restarted at that time (when
performing db2instance -list at the time, the primary CF was
PRIMARY
and the secondary was inPEER
), check cfdump.out*, cfdiag*.log, and core files for more details. However, if the CF servers were up and initialized at the time of the error, then there might be a uDAPL communication problem. - If a db2start command or a CONNECT statement was issued, to determine whether there is a different failure that caused these errors to appear as a side effect, see CF server failure.
- If this is not the case, a trace of the failing scenario is often
useful to determine the cause for the error. If CF trace
was enabled, dump it. To dump CF trace,
run the following command:
db2trc cf dump fileName
where you define the value for the fileName parameter. - To enable CF trace
if it was not already enabled, run the following command:
db2trc cf on -m "*.CF.xport_udapl.*.*"
. - IBM Service might additionally request an AIX system trace and AIX memory traces to facilitate problem determination.
- If CF trace on xport_udapl and any AIX system trace were recorded, collect this information. Run the db2support command to collect further diagnostic logs. Run snap -Y as root on all hosts, and contact IBM Service for further help.