IBM Support

Why db2 was hung while waiting for log files to be archived to third party vendor ?

Technical Blog Post


Abstract

Why db2 was hung while waiting for log files to be archived to third party vendor ?

Body

Occasionally you may see DB2 hung and it may be because of a problem in the third party vendor used to archive db2 logs. We have tried to explain the problem, its identification and solution below. The third party vendor used in our example below is TSM, please note we have seen the below issue with Veritas Netbackup as well. 

You may find the following messages in db2diag.log while db2 is hung :

2018-04-19-18.20.19.542647+480 E1644436A430         LEVEL: Info
PID     : 10289678             TID : 31008          PROC : db2sysc 0
INSTANCE: db2inst1              NODE : 000           DB   : SAMPLE 
HOSTNAME: host1
EDUID   : 31008                EDUNAME: db2logmgr (SAMPLE) 0
FUNCTION: DB2 UDB, data protection services, sqlpgArchiveLogFile, probe:3108
DATA #1 : <preformatted>
Started archive for log file S0180892.LOG.
The last message from db2logmgr process can be seen when it started archival of a log file. In this example, it is log file S0180892.LOG.

DB2 later logs following messages giving its current status through db2loggr process - DB2 is waiting for log files to be archived (please check the problem markers in BOLD) :
2018-04-19-20.08.08.970113+480 I1654942A420         LEVEL: Info
PID     : 10289678             TID : 31265          PROC : db2sysc 0
INSTANCE: db2inst1              NODE : 000           DB   : SAMPLE 
HOSTNAME: host1
EDUID   : 31265                EDUNAME: db2loggr (SAMPLE) 0
FUNCTION: DB2 UDB, data protection services, sqlpgWaitForLogArchives, probe:1780
MESSAGE : DB2 is waiting for log files to be archived.
2018-04-19-20.12.12.489858+480 I1658835A527         LEVEL: Info
PID     : 10289678             TID : 5467           PROC : db2sysc 0
INSTANCE: db2inst1              NODE : 000           DB   : SAMPLE
HOSTNAME: host1
EDUID   : 5467                 EDUNAME: db2loggr (SAMPLE) 0
FUNCTION: DB2 UDB, data protection services, sqlpgWaitForLogArchives, probe:1800
DATA #1 : <preformatted>
DB2 was unable to confirm logs were archived.
Return code -2029059911, FirstArchNum 108959, FirstArchNum2 4294967295, HeadExtentID 108972
2018-04-19-20.17.08.000204+480 E1659363A534         LEVEL: Warning
PID     : 10289678             TID : 101687         PROC : db2sysc 0
INSTANCE: db2inst1              NODE : 000           DB   : SAMPLE
APPHDL  : 0-18485              APPID: 10.xxx.xxx.xx.xxxxxx
AUTHID  : USER1                  HOSTNAME: host1
EDUID   : 101687               EDUNAME: db2agent (SAMPLE) 0
FUNCTION: DB2 UDB, data protection services, sqlpWaitForEDUsToComplete, probe:2330
MESSAGE : Some daemon(s) is(are) taking too long to shut down.

….and finally DB2 stays hung indefinitely in this state and logs above message in db2diag.log often. 

Let us look at exactly what happened here :

1> The db2diag.log file contains last entry of log archival at  starting at 18.20 hrs. Between that time and 20.17 hrs,there were entries from db2loggr processes ("DB2 is waiting for log files to be archived.") as noted before, but there was not a single entry from any db2logmgr processes.  

2> The db2logmgr processes should be running (since databases were activated and db2loggr processes were running), and should be writing entries to db2diag.log from time to time ("Successfully archived log file ..." message, or various error messages when hitting problems).  In particular, if db2logmgr was archiving a log to TSM and that returned an error rc, we should see a sqlpgArchiveLogFile probe 3160 "Failed to archive log file ..." message in db2diag.log.

3> It is the same process (db2logmgr) that is responsible for archiving logs to TSM (third party vendor). The db2logmgr calls the Vendor code and then the db2loggr waits for the db2logmgr to post that the archive attempt succeeded/failed. However, in above example, the Vendor (TSM) mechanism did not return back to the db2logmgr with the expected success/failure code. 

4> Since the Vendor code has control, there is nothing that DB2 can do until the Vendor code returns control back to db2, in this example with an archive attempt failed due to problem within TSM library.

If it had returned (FAILED) with some error code  for a number of times (database configuration parameter NUMARCHRETRY, default is 5) then log file/s are archived to the FAILARCHPATH (if configured).   

5> But in the example "archive to TSM” was HUNG (not failed) because of the problem with TSM library, then db2logmgr would be hanging and won’t return to db2 at all for db2 to take any action on it. This hang condition in the vendor code must somehow be broken by TSM/vendor to allow DB2 to once again operate properly. 

6> Once TSM vendor code returns error with some RC code and "failure to archive” message back to db2, db2 will handle it and take next action i.e. retry archiving, send to FAILARCHPATH etc based on db config settings. 

Note, we have seen many such problems from various vendors used for log archival by customers so starting in Db2 Version 11.1.3.3, the archival of log files (LOGARCHMETH1/2) using VENDOR or TSM methods can now be configured with a timeout on Unix environments. If the transmission of log data between Db2 and the TSM or vendor API is unresponsive for the specified timeout period, then Db2 will interrupt the archive log attempt and follow normal log archive failure protocol. For details see "--VENDOR_ARCHIVE_TIMEOUT" in the 

Configuration parameters for database logging
 

Similar db2 hang in log archival is seen along with new connections being refused at the time. Please refer below links for more details : 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11140190