Problem Determination: Hangs

A hang is a situation in which the database stops responding to incoming requests, or stops processing existing requests. Typically, a hang occurs due to a shared resource (such as a lock or latch) already being held. Hangs can be either temporary, causing just a small slowdown, often indistinguishable from a performance problem, or permanent, occurring when the holder does not release the resource.

Latch

A latch is an internal mechanism for access control, used when multiple processes or threads are trying to access the same shared resource simultaneously. For example, if an EDU acquires an exclusive latch for a resource, other EDUs cannot access that same resource until the latch is freed by the holding EDU.

In IBM Db2® pureScale®, a cross-member latch which requires both the global lock portion and the local latch portion is called "lotch" (abbreviated from “lock or latch”)

A deadlatch is a type of hang (usually defect-related) caused by resource contention involving latches.

Approaching hangs

It is critical to generate diagnostic data for a hang issue while the issue is happening. Looking at a hang/performance issue as a postmortem without collecting information during the problem’s occurrence will almost always lead to a dead end.

A critical step of problem determination is to narrow down the scope of occurrence. One of the many possible approaches is:
  1. Examine the operating system first. Look for excessive I/O, network traffic, errors in the system error logs.
  2. Look at the Db2 instance scope. Try to determine if the problem is affecting the entire instance. Are instance level commands working well?
  3. Examine the Db2 database scope. Problems with all databases? Or some? EDU or application scope. All applications affected? Or some? What about non-agent EDUs such as prefetchers or page cleaners?
If you have established that the issue is not an operating system hang, a network outage and that you are most likely dealing with a database or application issue, these are the most important pieces to collect:
  1. Call stacks
  2. Data for active EDUs
  3. Traces, including performance traces and/or profiling information
  4. Lock information
  5. Access plans

A call stack is a snapshot for the state of a Db2 EDU at the time the stack dump was captured. Call stack files are very similar to trap files, and are also known as "stack trace backs" or "stacks".

The call stack is generated automatically if processing cannot continue because of an exception, or for serviceability reasons. You can also manually generate a call stack.

In the call stack, you will find the function sequence that was running when the error occurred, as well as information about the state of the process when the file is generated.

On UNIX/Linux, the call stack file is text-based.

On Windows, the call stack file is binary. In order to translate the binary file into text, use the formatting utility db2xprt.

Call stack files include:

  • Build date
  • Version number
  • Time of the dump
  • Signal or exception which generated the dump
  • Process and thread ID
  • Loaded libraries (commonly referred to as the “map”)
  • Address of signal handlers
  • Register dumps
  • Call stack – a detailed call stack
  • A dump of the operating system memory sets
  • Latch information for the EDU
  • Locks being waited on
  • Assembly code dump

Pay special attention to the call stack, latch information for the EDU, and locks being waited on.

First occurrence data capture (FODC)

The db2fodc executable included in Db2 collects a predetermined set of data, usually sufficient for hang problem determination. Use db2fodc whenever possible.

Note: Data collected by the db2fodc executable may not be useful for determining problems unrelated to Db2.

Automatic FODC is performed by Db2 automatically when an outage or error condition is detected. Manual FODC is invoked manually by the user. The FODC package is a directory containing diagnostic information collected by the manual or automatic FODC.

The behavior of the data collection is controlled in the following ways:
  • Arguments passed to both db2fodc.
  • Using the DB2FODC registry variable.
  • Customizing the data collections scripts.

To customize data collection scripts on UNIX, copy sqllib/bin/db2cos_hang to sqllib/adm/db2cos_hang, then modify the copied contents in sqllib/adm/db2cos_hang.

On Windows, modify the default script in sqllib/bin/db2cos_hang.bat.

On UNIX, db2fodc first tries to execute sqllib/adm/db2cos_hang, if it exists. If sqllib/adm/db2cos_hang is not found, sqllib/bin/db2cos_hang is used.

On Windows, sqllib/bin/db2cos_hang.bat is always launched.

A useful trick is to modify db2cos_hang by enabling no_wait="ON". This will disable sleep between iterations, allowing the script to finish faster. This can be helpful in time-sensitive situations.

Essential tools

The db2pd tool is a monitor and troubleshooting command that retrieves internal metadata and snapshot information from a running Db2 instance. For options, run db2pd -help. The db2pd command is completely non-intrusive and does not acquire latches. It provides fast retrieval without impacting the engine. It can even be run when the system is hanging.

You can perform read-only operations with db2pd. Operations modifying Db2 behavior have been moved to db2pdcfg.

The dsmtop tool provides dynamic, real-time monitoring of a running Db2 system. It replaces the deprecated db2top. The tool calls Db2 monitor APIs repeatedly in the background and displays the result in a “semi-graphic” console interface. It can help to calculate several common matrices, such as the buffer pool hit ratio. It also provides some basic performance analysis.

If the system is completely hung, all commands are stuck, and the usual ways to stop the instance (db2stop force ) are hanging too, you will have to kill the instance. This will forcefully bring down the instance.

To run to shut down the instance, run the following command:
UNIX/Linux
db2_kill
Windows
db2nkill.exe
Once the instance has been shut down, some time should be spent to ensure there are no stray Db2 processes or resources:
UNIX/Linux
ps –elf/ipcs –a 
Windows
Task Manager
If the Db2 engine process cannot be killed even using the most privileged operating system signals such as SIGKILL on UNIX/Linux or taskkill on Windows, then the process is likely stuck in the operating system kernel.

In this case, the problem is likely at the operating system level. You will need to contact operating system support and/or collect operating system dumps. Typically, the only way to get out of this state is to reboot the machine once the necessary operating system data has been collected.