IBM Support

75 ways to demystify DB2 #81: Techtip : DB2 does not accept new connections if it is archiving a log file to vendor program and database is deactivating at the same time

Technical Blog Post


Abstract

75 ways to demystify DB2 #81: Techtip : DB2 does not accept new connections if it is archiving a log file to vendor program and database is deactivating at the same time

Body

When DB2 archiving a log file to vendor program (for example TSM), if TSM does not respond to DB2 (neither success nor failed), and if the database is deactivating during the "non-response" window, DB2 will get stuck in the deactivting process and hence cannot accept new connections to the database.

DB2 log manager does not get response from TSM, and because user's database is not explicitly activated, when the last connection leaves the database, the deactivate process begins. During the deactivate process, log reader will check if there are any logs waiting to be archived, because TSM did not respond to DB2's archive request, log reader will wait for the log to be archived and eventually get timed out. Because the database is in the process of deactivating, it cannot accept any new connections at this stage until the deactivating is finished. This is why user cannot connect to the database and the database appears to be "hung"

 

This issue has the following pattern:

1.    DB2 log manager starts a log archive request for TSM (this is vendor in our example) to archive one of its logs:

2014-09-01-00.12.11.806209-420 E11852947A377      LEVEL: Info
EDUID   : 7311                 EDUNAME: db2logmgr (XXXXX) 0
Started archive for log file S0133372.LOG.
 
2.    DB2 log manager does not get response from TSM, and because user's database is not explicitly activated, when the last connection leaves the database, the deactivate process begins. During the deactivate process, log reader will check if there are any logs waiting to be archived, because TSM did not respond to DB2's archive request, log reader will wait for the log to be archived and eventually get timed out. Because the database is in the process of deactivating, it cannot accept any new connections at this stage until the deactivating is finished. This is why user cannot connect to the database and the database appears to be "hung" :

2014-09-01-02.43.35.379463-420 I11867573A374      LEVEL: Info
EDUID   : 15760                EDUNAME: db2loggr (XXXXX) 0
MESSAGE : DB2 is waiting for log files to be archived.

2014-09-01-02.47.35.381139-420 I11869935A482      LEVEL: Info
EDUID   : 15760                EDUNAME: db2loggr (XXXXX) 0
DB2 was unable to confirm logs were archived.
Return code -2029059911, FirstArchNum 133372, FirstArchNum2 4294967295, HeadExtentID 133382

 
3.    DB2 sqlpterm() function is triggered after log reader is timed out, it tries to shut down the EDUs on the deactivating database, but it cannot shut down the log archive EDU because it is waiting for TSM's response for the log archive. Hence we see the following messages:

2014-09-01-02.52.39.468809-420 E11873692A466      LEVEL: Warning
EDUID   : 39250                EDUNAME: db2agent (XXXXX) 0
MESSAGE : Some daemon is taking too long to shut down.

2014-09-01-03.18.28.680576-420 E11904313A466      LEVEL: Warning
EDUID   : 39250                EDUNAME: db2agent (XXXXX) 0
MESSAGE : Some daemon is taking too long to shut down.

 
4.    Finally the archive fails after over 3 hours because of vendor error (in this case, TSM is the vendor):

2014-09-01-03.23.13.210702-420 E11934995A572      LEVEL: Error
EDUID   : 7311                 EDUNAME: db2logmgr (XXXXX) 0
FUNCTION: DB2 UDB, data protection services, sqlpgArchiveLogFile, probe:3160
MESSAGE : ZRC=0x86100024=-2045771740=SQLP_SHRLIB_VENDOR_DEV_ERR
          "A vendor device reported a shared library error."
DATA #1 : <preformatted>
Failed to archive log file S0133372.LOG to TSM chain 0 from /db2data/MBTS/db2inst1/NODE0000/SQL00003/SQLOGDIR/.
 
5.    Also, from procstack, we can clearly see that DB2 log manager is waiting for the vendor (the vendor is TSM):

---------- tid# 72089699 (pthread ID:   7311) ----------
0x090000000002db14  read(??, ??, ??) + 0x274
0x090000000828a2e4  sqloReadNamedPipe(0x90000000828a6dc, 0x400, 0x7000000113fd2a0, 0x0, 0x7000000113fd2b0, 0xffffffffffffffff) + 0x98
0x090000000828a3fc  @83@sqloReadVendorRC(SQLO_VENDOR_HANDLE* const,const int)(0x200000000000000, 0xffffffffffffffff) + 0x7c
0x090000000744dbe8  sqloInvokeVendorFunction(0x100000001, 0x500000005, 0xffffffffffffffff) + 0x130
0x09000000077377b0  sqluVendorCallWrapper(SQLO_VENDOR_HANDLE*,sqluFencedVendorCommBuf*,const unsigned int,sqluFencedVendorProcessState*,const int)(0x7700000c00427af, 0x7700000c004279a, 0x100000001, 0x440048020000002f, 0xffffffffffffffff) + 0x788
0x090000000aa74e6c  sqluFencedVendorCB::callWrapper(const unsigned int,const int)(0x7700000c004279a, 0x100000001, 0x90000000bcdc1cc) + 0x50
0x090000000aac38e0  sqlpInitVendorDevice(SQLPG_VENDOR_CB*,const char*,unsigned int,unsigned long)(??, ??, ??, ??) + 0x3ac
0x090000000aac96f0  sqlpgArchiveLogVendor(SQLP_DBCB*,SQLPG_VENDOR_CB*,unsigned long,char*,const char*,unsigned int,unsigned int*,unsigned int*,unsigned int*,unsigned long*,unsigned int*)(??, ??, ??, ??, ??, ??, ??, ??) + 0x358
0x090000000aac4ec4  sqlpgArchiveLogFile(SQLP_DBCB*,unsigned int,unsigned long,unsigned long,char*,unsigned long)(0x208f9, 0x208fc000208fc, 0x1, 0x2, 0x7000000113fe82f, 0x0) + 0xd08
0x090000000aac04e4  sqlpgArchivePendingLogs(SQLP_DBCB*,unsigned long,unsigned long,char*,unsigned int*,unsigned int*,unsigned long*,unsigned int*)(0x7700000c0000348, 0x208f9000208f9, 0x2, 0x0, 0x7000000113feb00, 0x18c6, 0x0, 0x0) + 0x198
0x090000000aabc938  sqlpgLogManager(char*,unsigned int)(??, ??) + 0xe88
0x0900000007f84f6c  sqloEDUEntry.fdpr.chunk.7(??) + 0x18
0x0900000000790d50  _pthread_body(??) + 0xf0

 

The root cause is with the vendor, TSM in this example. They would have to fix the no-response to DB2 archiving issue.

From DB2 side, there's some work arounds, although they cannot fix the issue, but can help reduce the Hang problem:

1. When this issue occurs, we can terminate the vendor program (in this case, TSM) by using the command "db2pd -db <db name> -fvp lam1 term ". This will force the vendor program to set  lam1(LOGARCHMETH1) to restart, and hence database can deactivate successfully.
2. Because this issue only happens when the database is deactivating, we can deliberately keep the database active by using command "db2 activate db <dbname>"
3. Change the log archive method to DISK instead of vendor program.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13286809