Why db2luw instance went down suddenly without any diagnostic dumps
Biswarup(Bis)Mukherjee 120000HKTY Visits (1680)
Why db2luw instance went down suddenly without any associated diagnostic dumps.
This is something explained as part of different documents But, still been asked on this currently.
So, I thought to write few lines on that.
First of all we need to understand how db2 dumps any diagnostics when it experience a crash or, trap situation.
Db2 has it's own signal handler which handles the UNIX signals to invoke it's signal handler function/routine where it has the logic on what all to be dumped based on what signal is received by db2 instance process.
Very common signal where crash happens due to memory boundary violation is the SIGSEGV (signal 11).
Db2 can handle that kind of signal to dump needed diagnostics.
But, there are some UNIX signal like SIGKILL (signal 9) which cannot be handled as per UNIX design.
That means if the db2sysc process received a signal 9 it will not have any capability to dump any diagnostics as the signal cannot be handled.
As a result db2 can go down silently without leaving much details or, no dump at all.
It's very easy to test this out.
Just create a db2 instance and while it's running issue a kill -9 to the db2sysc process.
Following kind of messages will show up in the db2diag.log as the first set of messages when this issue is hit :
This says, db2wdog process found that process 2818312 (which is db2sysc process, check earlier in the db2diag.log)
is killed with signal 9 ( 0009 pattern)
Based on db2 level it can show up bit differently too.
Example , 0201 0000 0900 0000
Only time when Db2 can issue signal 9 for itself when it needs to cleanup the instance after a crash or trap or, even sometime after a graceful shutdown.
To find out if the signal 9 is issued by db2 itself to cleanup the left over instance check if the signal 9 message is logged after the initial crash or not. The cleanup will happen as after effects of a crash or, trap.
Example messages will be,
Any other kind of signal 9 usage will be originated from outside of db2.
Common kind of signal 9 sources :
- UNIX experienced out of memory (OOM) condition. To protect the OS it has issued kill -9 to db2 process
- Cluster manager decided to bring down the db2 instance based on some criteria
- Somebody manually issued kill -9 to the db2sysc process.
- Somebody issued db2_kill accidentally with wrong instance.
Another very common question asked is. Why db2 cannot log any details about who has issued the signal 9. What is the source of signal 9.
The answer to that will be, Db2 has no control or, awareness on who can issue a signal 9 from outside. That is not an expected behavior to Db2 engine. So, it cannot keep track of the origin of external signal 9 source.
Then, how that could be tracked ?
That has to be done at OS level. It might not be easy to track always.
If any OS level auditing is possible that should be used.
Or, based on OS there are some ways.
As for example in AIX there is probevue which could be used as explained in this Technote,