IBM Support

DB2 Purescale : Severe messages logged in db2diag.log when CF cannot access a page in time through Write and Register command

Technical Blog Post


Abstract

DB2 Purescale : Severe messages logged in db2diag.log when CF cannot access a page in time through Write and Register command

Body

A couple of db2 purescale customers have recently observed that the db2diag.log in a purescale environment might have some SEVERE messages logged with no operational problems in database associated with it.


The severe messages logged in db2diag.log are as follows :


 2017-07-10-11.40.16.000086+540 I202891A5897         LEVEL: Severe
 PID     : 23068794             TID : 433823         PROC : db2sysc 1
 INSTANCE: db2inst1               NODE : 001           DB   : inst1
 APPHDL  : 1-6610               APPID: *N1.db2inst1.170710024007
 AUTHID  : DB2inst1               HOSTNAME: MEMBER1
 EDUID   : 433823               EDUNAME: db2agent (inst1) 1
 FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF,
SAL_GBP_HANDLE::SAL_WriteAndRegister, probe:24519
 MESSAGE : CA RC= 2148073503
 DATA #1 : SAL CF Index, PD_TYPE_SAL_CF_INDEX, 8 bytes
 1
 DATA #2 : SAL CF Node Number, PD_TYPE_SAL_CF_NODE_NUM, 2 bytes
 128
 DATA #3 : Page Name, PD_TYPE_SD_PAGE_NAME, 16 bytes
 {pgNum:poolId:objId:oType} = {81798:3:60:1} (decimal)
 DATA #4 : Page header, PD_TYPE_SQLB_PAGE_HEAD, 48 bytes
 pageHead: {tbspSeed:3;obj:60;type:1} bphPPNum:81798 bphOPNum:81798
   begoff:                      48
   datlen:                   16336
   revnum:                    1280
   pagelsn:     0000000759A577F8
   flag:                        2a
   future32:                    0
   cbits1to31:               2852e
   cbits32to63:                  0
 DATA #5 : Page Address, PD_TYPE_SD_PAGE_ADDR, 8 bytes
 0x0a000303d5b14000
 DATA #6 : Page Size, PD_TYPE_SD_PAGE_SIZE, 8 bytes
 16384
 DATA #7 : Page Size, PD_TYPE_SD_PAGE_SIZE, 8 bytes
 0
 DATA #8 : XI Index, PD_TYPE_SD_XI_INDEX, 4 bytes
 2060517
 DATA #9 : Castout Class, PD_TYPE_SD_CASTOUT_CLASS, 2 bytes
 921
 DATA #10: CF Process ID, PD_TYPE_SD_CASTOUT_PROCESS_ID, 8 bytes
 0
 DATA #11: SAL CF encoded state, PD_TYPE_SAL_ENCODED_STATE, 8 bytes
 0x0002000200020004
 DATA #12: Castout Count, PD_TYPE_SD_CASTOUT_COUNT, 4 bytes
 0
 DATA #13: Directory Change Count, PD_TYPE_SAL_DIRCHANGE_COUNT, 4 bytes
 0
 DATA #14: Max LSN, PD_TYPE_SD_MAXLSN, 8 bytes
 31568787485
 DATA #15: LsnNat, PD_TYPE_SD_LSNNAT, 8 bytes
 31568787447
 DATA #16: LsnNat, PD_TYPE_SD_LSNNAT, 8 bytes
 31568787448
 DATA #17: Codepath, 8 bytes
 9:12:15:19:28:30:31:32:33
 DATA #18: SAL_GBP_HANDLE, PD_TYPE_SAL_GBP_HANDLE, 9200 bytes
 m_caStructName: db2.db2inst1.inst1.gbp
 m_caKey:
    m_caKey: Name=db2.db2inst1.inst1.gbp_caKey sample=0x2000200020004
    m_primaryWaiters (sample): 0
    m_waitForPrimaryDisabled: 0
    History:
       Last 2 out of 2 changes
           - 0x2000200020004 at 2017-07-10-00.26.01
           - 0x1000100000000 at 2017-07-10-00.26.01
 m_connectKey:
    m_caKey: Name=db2.db2inst1.inst1.gbp_connectKey sample=0x2000200020004
    History:
       Last 2 out of 2 changes
           - 0x2000200020004 at 2017-07-10-00.26.01
           - 0x1000100000000 at 2017-07-10-00.26.01
 m_disconnectKey:
    m_caKey: Name=db2.db2inst1.inst1.gbp_disconnectKey
sample=0x2000200020004
    History:
       Last 3 out of 3 changes
           - 0x2000200020004 at 2017-07-10-00.26.01
           - 0x2000100020000 at 2017-07-10-00.26.01
           - 0x1000100000000 at 2017-07-10-00.26.01
 m_reconstructKey:
    m_caKey: Name=db2.db2inst1.inst1.gbp_reconstructKey
sample=0x2000200020004
    History:
       Last 2 out of 2 changes
           - 0x2000200020004 at 2017-07-10-00.26.01
           - 0x1000100000000 at 2017-07-10-00.26.01
 m_catchupPercentComplete: 100 %
 m_catchupTimePassed(secs):  0
 m_pCatchupLotch: 0x0a000301cb7eba80: 00080001000000000000000076
SQLP_VALLOCK (SALCATCHUP: caStructType=1) (held: F)
 m_pWriteProtectLotch: 0x0a000301cb7ebc40: 000B0001000000000000000076
SQLP_VALLOCK (SALWRITEPRITECT: caStructType=1) (held: F)
 m_pSimplexLotch: 0x0a000301cb7ebe00: 00090001000000000000000076
SQLP_VALLOCK (SALSIMPLEX: caStructType=1) (held: F)
 m_pauseDurationSoAPIretryIsntSuperFastMilliSec: 500
 m_catchupRetryStartTimestamp: 2017-07-10-00.26.01


These severe error messages get "CA RC= 2148073503". Checking what that RC means :


 $ db2diag -cfrc 2148073503
 Input ZRC string '2148073503' parsed as 0x8009001F (-2146893793).

 Attempting to lookup value 0x8009001F (-2146893793) as a ZRC

 ZRC value to map: 0x97C9001F (-1748434913)

 ZRC class :
         Cluster caching facility transport common errors and warnings
 (Class Index: 23)
 Component:
         CF_XPORT_COMMON ; cluster caching facility xport common
 (Component Index: 201)
 Reason Code:
         31 (0x001F)

 Identifer:
         CF_XPORT_RECV_MRB_FAILURE
 Identifer (without component):
         SQLZ_RC_CF_XPORT_RECV_MRB_FAILURE

 Description:
         Receive MRB failure

 Associated information:
         Sqlcode -2051
 SQL2051N  There was a communication failure between a DB2 member and a
 CF. Reason code = "3". 
CF identifier: "".  Host name: "".

         Number of sqlca tokens : 3
         Diaglog message number: 1


Checking SQL2051N RC=3 :
 

https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.messages.sql.doc/doc/msql02051n.html

So we know that there was a communication failure between a DB2 member and a CF.

This message is returned when the database manager detects problems that interfere with communications between a DB2 member and cluster caching facility (CF).


Now the fact that customer did not see any impact of this on their normal db operations, it makes it difficult to analyze.

 

Usually, we have seen such errors, when some type of timeouts happen during the member and CF communication.

The CA_XPORT_RECV_MRB_FAILURE is seen when operations against CF would be delayed and worker threads would miss to receive MRB (Message Response Block) within the
retries.

It could also be caused by something like a uDAPL communication delay problem between a DB2 member and a CF, or some busy CF worker threads ( insufficient number of CF worker threads compared to the requested workload ) situation on CF server.


Check cfdiag to see if we find more details of this problem.


In this example case, we saw DMA_RC -2146762722 at the same time when we saw the error message in db2diag.log.


 2017-07-10-11.40.02.0208306000+540 E123456789A256   LEVEL    : Error
 PID      : 4653300  TID    :      14393
 NODE     : 000
 HOSTNAME : CF1
 FUNCTION : CA trace, log_error
 MESSAGE  : CA server has encountered an error.
 DATA #1  :
 DMA_RC: (-2146762722)

 

 2017-07-10-11.40.16.0008748000+540 E123456789A372   LEVEL    : Event
 PID      : 4653300  TID    :      11309
 NODE     : 000
 HOSTNAME : CF1
 FUNCTION : CA svr_clstrmgr, conn_control
 MESSAGE  : Changed Connection Status : NOT ALLOWED
 DATA #1  :
 Node ID        : 2
 Instance ID    : 0
 EP             : 0X44DD7B30
 HCA ID         : 3
 Worker ID      : 3
 Client ID      : 4

 

 2017-07-10-11.40.16.0024466000+540 E123456789A320   LEVEL    : Event
 PID      : 4653300  TID    :      16192
 HOSTNAME : CF1
 FUNCTION : CA svr_clstrmgr, allowlist::reject_conn
 MESSAGE  : allowlist Table: Found node in list.
 DATA #1  :
 Sequence Number: 1256
 Node Type      : 1
 Node ID        : 2
 Instance ID    : 0


 2017-07-10-11.40.16.0027289000+540 E123456789A374   LEVEL    : Event
 PID      : 4653300  TID    :      16192
 HOSTNAME : CF1
 FUNCTION : CA svr_clstrmgr, UWorkMgr::assign_client_to_worker
 MESSAGE  : Changed Connection Status : ALLOWED
 DATA #1  :
 Node ID        : 2
 Instance ID    : 0
 EP             : 0X4470BFB0
 HCA ID         : 3
 Worker ID      : 5
 Client ID      : 5


This translates to RDMA error, SQL2051N for the communication failure between a DB2 member and a  CF. Reason code = "1".

RC=1 means : The error occurred or was detected in the User-Level Direct Access Programming Library (uDAPL.)

As can be seen in above cf db2diag, you can see that connections were NOT ALLOWED for some milliseconds and then status changes to ALLOWED.
So looks like there is some intermittent issue with RDMA switch (uDAPL communication).


Customer's until now have not noticed any impact of this on db2 operations as the problem exists for a very short time (in milliseconds) but if this persists longer you may notice

delays, connection drops, slow SQL processing etc. in DB2. But DB2 here is just the victim of underlying RDMA Timeout errors.

 

These problems should be reported to AIX support to investigate the root cause of uDAPL DMA Timeout errors.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13286233