IBM Support

DB2 instance shutdown due to SIGKILL generated by the Out of memory (OOM) killer on Linux platform.

Troubleshooting


Problem

Looking at the db2diag.log entries, the data "0900 0000" (signal 9, which is SIGKILL) indicates the fact that a DB2 process (PID = 1122) was killed with signal 9. The instance will crash if a SIGKILL is issued for any DB2 engine process. DB2 doesn't issue signal 9 against its own engine process. There is no signal handler routine for SIGKILL so no trap or core can be generated. This signal must have been issued either manually by a user, programmatically by a user application or by the operating system. In the first case, there could be a record of the kill command in the user's shell history log file. Something external to DB2 caused the crash, it is not possible for DB2 to record which application, user, or OS issue caused it to go down. The watchdog process is responsible for handling abnormal termination (here signal 9) cleanup of the main engine process and all FMPs. DB2 is the victim.

Symptom

DB2 Instance Shutdown.

The key entry in db2diag.log looks like :

-------------------------------------------------------------------------------------------------------------------------------------------------------
2010-10-09-01.06.16.347313+660 E13087E552 LEVEL: Severe
PID : 1120 TID : 46912711420224PROC : db2wdog 0
INSTANCE: db2inst1 NODE : 000
EDUID : 2 EDUNAME: db2wdog 0
FUNCTION: DB2 UDB, base sys utilities, sqleWatchDog, probe:20
MESSAGE : ADM0503C An unexpected internal processing error has occurred.
ALL DB2 PROCESSES ASSOCIATED WITH THIS INSTANCE HAVE BEEN SHUTDOWN.

Diagnostic information has been recorded. Contact IBM Support for further assistance.

2010-10-09-01.06.17.119332+660 E13640E422 LEVEL: Error
PID : 1120 TID : 46912711420224 PROC : db2wdog 0
INSTANCE: db2inst1 NODE : 000
EDUID : 2 EDUNAME: db2wdog 0
FUNCTION: DB2 UDB, base sys utilities, sqleWatchDog, probe:21
DATA #1 : Process ID, 4 bytes
1122
DATA #2 : Hexdump, 8 bytes
0x00002AAAB77FC378 : 0201 0000 0900 0000
-------------------------------------------------------------------------------------------------------------------------------------------------------

Cause


The Linux kernel has an interesting way of dealing with memory exhaustion, and it comes in the way of the Linux OOM (Out-Of-Memory) killer. When invoked, the OOM killer will begin terminating processes in order to free up enough memory to keep the system operational. In this scenario, OOM Killed process 1126 (db2sysc). This occurs because all available memory, including disk swap space, has been allocated and can be verified using 'free' command.
 

Environment

  • This issue only occurs for DB2 running on supported Linux platforms.

Diagnosing The Problem


The footprints of OOM killer can be seen in the operating system error log /var/log/messages or dmesg command. Out of memory condition : all available memory, including disk swap space, has been allocated.
Below is the example, snip from /var/log/messages:
-------------------------------------------------------------------------------------------------------------------------------------------------------
Oct 9 01:06:09 lqportdb1 kernel: db2sysc invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Oct 9 01:06:09 lqportdb1 kernel: Call Trace: <ffffffff8015d94e>{oom_kill_process+87}
Oct 9 01:06:09 lqportdb1 kernel: <ffffffff8015dd82>{out_of_memory+299} <ffffffff8015f96b>{__alloc_pages+600}
Oct 9 01:06:09 lqportdb1 kernel: <ffffffff801612b8>{__do_page_cache_readahead+265} <ffffffff80137446>{del_timer_sync+12}
Oct 9 01:06:09 lqportdb1 kernel: <ffffffff802e4431>{schedule_timeout+146} <ffffffff88012c00>{:dm_mod:dm_any_congested+61}
Oct 9 01:06:09 lqportdb1 kernel: <ffffffff8015d079>{filemap_nopage+336} <ffffffff8016b068>{__handle_mm_fault+830}
Oct 9 01:06:09 lqportdb1 kernel: <ffffffff801455d8>{lock_hrtimer_base+37} <ffffffff802e78bf>{do_page_fault+2919}
Oct 9 01:06:09 lqportdb1 kernel: <ffffffff802e4870>{schedule_hrtimer+49} <ffffffff8014580d>{hrtimer_nanosleep+130}
Oct 9 01:06:09 lqportdb1 kernel: <ffffffff8010a883>{error_exit+0}
.
.
Oct 9 01:06:09 lqportdb1 kernel: Free swap = 0kB
Oct 9 01:06:09 lqportdb1 kernel: Total swap = 4194296kB
Oct 9 01:06:09 lqportdb1 kernel: Free swap: 0kB
Oct 9 01:06:09 lqportdb1 kernel: 2099200 pages of RAM
Oct 9 01:06:09 lqportdb1 kernel: 41113 reserved pages
Oct 9 01:06:09 lqportdb1 kernel: 69027 pages shared
Oct 9 01:06:09 lqportdb1 kernel: 191 pages swap cached
Oct 9 01:06:09 lqportdb1 kernel: Out of Memory: Kill process 1121 (db2syscr) score 79429 and children.
Oct 9 01:06:09 lqportdb1 kernel: Out of memory: Killed process 1126 (db2sysc).
-------------------------------------------------------------------------------------------------------------------------------------------------------

Resolving The Problem

The Linux OOM-Killer is the cause of the DB2 problem as described above. One step in the resolution is to have a Linux system administrator review the system memory usage and verify that there is available memory, including disk swap space. Most Linux kernels now allow for the tuning of the OOM-killer. It is recommended that a Linux system administrator perform a review and determine the appropriate settings such as vm.swappiness (refer to Q8 and Q9 from IBM DB2 LUW Memory FAQ ).

If there is no evidence of OOM-Killer, troubleshoot further by using one of Linux tools below to find the process issuing SIGKILL.

a) auditctl
OR
b) Install and configure Redhat Linux SystemTap
Then run this script SYSTEMTAP: KILL() [WHO KILLED MY PROCESS?]

#! /usr/bin/env stap
/*
 * signal2.st: Track sender of SIGKILL to a given process.
 *
 * Run as user 'root' using the following command line:
 *
 *     stap -o signal2.out signal2.st
 *
 * dalla
 */
probe syscall.kill
{
    if (sig == 9) {
        printf("[%s - %d - %d] sent SIGKILL to pid %d\n",
               execname(), pid(), tid(), pid);
    }
}


Sample Run

# stap -o signal2.out signal2.st

$ ps -elf|grep db2inst1
4 S root     28185 28125  0  80   0 - 56581 poll_s 13:53 pts/1    00:00:00 sudo su - db2inst1
4 S root     28202 28185  0  80   0 - 55144 do_wai 13:53 pts/1    00:00:00 su - db2inst1
4 S db2inst1 28203 28202  0  80   0 - 29113 do_wai 13:53 pts/1    00:00:00 -bash
4 S root     28317     1  2  80   0 - 330940 futex_ 13:53 pts/1   00:00:00 db2wdog 0 [db2inst1]
4 S db2inst1 28319 28317  4  80   0 - 413578 futex_ 13:53 pts/1   00:00:00 db2sysc 0
(removed some output)

$ kill -9 28319
 
From another session, stap output signal2.out shows PID 28329 was terminated
 
[sshd - 27830 - 27830] sent SIGKILL to pid 27833
[dbus-daemon - 769 - 769] sent SIGKILL to pid 28147
[bash - 28203 - 28203] sent SIGKILL to pid 28319
[db2syscr - 28317 - 28318] sent SIGKILL to pid 28339
[db2syscr - 28317 - 28318] sent SIGKILL to pid 28329

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"Database Objects\/Config - Instance","Platform":[{"code":"PF016","label":"Linux"}],"Version":"9.8;9.7;10.1;10.5;11.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
04 January 2019

UID

swg21449871