IBM Support

Why db2luw instance went down suddenly without any diagnostic dumps

Technical Blog Post


Abstract

Why db2luw instance went down suddenly without any diagnostic dumps

Body

Why db2luw instance went  down suddenly without any associated diagnostic dumps.

This is something  explained as part of different documents   But,  still been asked on this currently.

So,   I  thought to write few lines on that.

First of all  we need to understand how db2  dumps any diagnostics when it experience a  crash or,  trap situation.

Db2 has it's own  signal handler which  handles the UNIX signals to invoke it's signal handler function/routine where it has the logic on what all to be dumped based on what signal  is received by db2 instance process.

Very  common signal  where crash happens due to memory boundary violation is the SIGSEGV (signal 11).

Db2 can handle that  kind of signal  to dump needed diagnostics.

But,  there are some UNIX signal like SIGKILL  (signal 9) which cannot be handled as per UNIX design.

That means  if the db2sysc process received a signal 9  it will not have any capability to dump  any diagnostics as the signal cannot be handled.

As  a result  db2 can go down silently without  leaving much details or,   no dump at all.

It's  very  easy to test this out.  

Just  create a db2 instance  and while  it's  running   issue  a  kill  -9  to the  db2sysc  process.

Following  kind of  messages will show up in the db2diag.log  as the  first set of messages when  this issue is hit  :

2018-12-11-12.18.39.959403-300 E16012A577           LEVEL: Severe
PID     : 21364950             TID : 258            PROC : db2wdog
INSTANCE: db2inst1             NODE : 000
HOSTNAME: myhost1
EDUID   : 258                  EDUNAME: db2wdog [db2inst1]
FUNCTION: DB2 UDB, base sys utilities, sqleWatchDog, probe:20
MESSAGE : ADM0503C  An unexpected internal processing error has occurred. All
          DB2 processes associated with this instance have been shutdown.
          Diagnostic information has been recorded. Contact IBM Support for
          further assistance.

2018-12-11-12.18.40.004468-300 E16590A455           LEVEL: Error
PID     : 21364950             TID : 258            PROC : db2wdog
INSTANCE: db2inst1        NODE : 000
HOSTNAME: myhost1
EDUID   : 258                  EDUNAME: db2wdog [db2inst1]
FUNCTION: DB2 UDB, base sys utilities, sqleWatchDog, probe:9064
DATA #1 : Process ID, 4 bytes
2818312
DATA #2 : Hexdump, 8 bytes
0x07000000003FDC58 : 0000 0102 0000 0009                        ........

This says,   db2wdog  process found that  process 2818312   (which is db2sysc process, check  earlier in the db2diag.log)

is killed with  signal  9   ( 0009 pattern)

Based on db2 level it can show up bit differently  too.

Example ,  0201 0000 0900 0000 

Only  time when Db2  can issue  signal  9  for itself   when  it needs to cleanup  the instance after a   crash or trap   or,  even sometime  after  a  graceful shutdown. 

To  find out  if the signal 9 is issued  by db2 itself  to cleanup the  left over instance  check if the signal  9   message  is logged after the initial  crash or  not. The cleanup  will  happen as after effects of a crash or,  trap.

Example messages will be,

2018-12-11-12.18.40.006951-300 I18450A984           LEVEL: Event
PID     : 21364950             TID : 258            PROC : db2wdog
INSTANCE: db2inst1        NODE : 000
HOSTNAME: myhost1
EDUID   : 258                  EDUNAME: db2wdog [db2inst1]
FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10
MESSAGE : Sending SIGKILL to the following process id
DATA #1 : signed integer, 4 bytes
59375676

....

2018-12-11-12.18.40.046480-300 I19435A390           LEVEL: Event
PID     : 21364950             TID : 258            PROC : db2wdog
INSTANCE: db2inst1        NODE : 000
HOSTNAME: myhost1
EDUID   : 258                  EDUNAME: db2wdog [db2inst1]
FUNCTION: DB2 UDB, oper system services, sqloCleanUpPosixIPCResources, probe:100
MESSAGE : Clean up POSIX resources attempt from engine.

Any other kind of signal 9  usage will be originated from outside of  Db2.

Common kind of   signal  9  sources :

-    UNIX  experienced out of memory (OOM) condition.  To  protect the OS  it has issued kill -9 to db2 process

-    Cluster manager decided to bring down the db2 instance based on some criteria

-    Somebody manually  issued  kill -9  to the db2sysc process.

-    Somebody  issued  db2_kill  accidentally with wrong instance.

Another  very common question asked is.   Why  db2 cannot log any details  about who has issued  the signal 9.  What is the source of signal 9.

The answer to that  will be,  Db2 has no control or,  awareness on  who can issue a signal 9 from outside.  That is not an expected behavior to  Db2 engine.  So,  it cannot keep track of the origin of external signal 9  source.

Then,  how that could  be tracked  ?

That has to be done at OS level.  It might not be easy to track always.

If  any  OS level auditing is  possible that should  be used.

Or,  based on OS there are some ways.

As for example in  AIX  there is probevue which could be used  as explained in  this Technote,

/support/pages/node/239789

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11139926