Kernel panics with the message "GPFS deadman switch timer has expired and there are still outstanding I/O requests"

This problem can be detected by an error log with a label of KERNEL_PANIC, and the PANIC MESSAGES or a PANIC STRING.

For example:
GPFS Deadman Switch timer has expired, and there are still outstanding I/O requests

GPFS is designed to tolerate node failures through per-node metadata logging (journaling). The log file is called the recovery log. In the event of a node failure, GPFS performs recovery by replaying the recovery log for the failed node, thus restoring the file system to a consistent state and allowing other nodes to continue working. Prior to replaying the recovery log, it is critical to ensure that the failed node has indeed failed, as opposed to being active but unable to communicate with the rest of the cluster.

In the latter case, if the failed node has direct access (as opposed to accessing the disk with an NSD server) to any disks that are a part of the GPFS file system, it is necessary to ensure that no I/O requests submitted from this node complete once the recovery log replay has started. To accomplish this, GPFS uses the disk lease mechanism. The disk leasing mechanism guarantees that a node does not submit any more I/O requests once its disk lease has expired, and the surviving nodes use disk lease time out as a guideline for starting recovery.

This situation is complicated by the possibility of 'hung I/O'. If an I/O request is submitted prior to the disk lease expiration, but for some reason (for example, device driver malfunction) the I/O takes a long time to complete, it is possible that it may complete after the start of the recovery log replay during recovery. This situation would present a risk of file system corruption. In order to guard against such a contingency, when I/O requests are being issued directly to the underlying disk device, GPFS initiates a kernel timer that is referred to as the deadman switch timer. The deadman switch timer goes off in the event of disk lease expiration, and checks whether there is any outstanding I/O requests. If there is any I/O pending, a kernel panic is initiated to prevent possible file system corruption.

Such a kernel panic is not an indication of a software defect in GPFS or the operating system kernel, but rather it is a sign of
  1. Network problems (the node is unable to renew its disk lease).
  2. Problems accessing the disk device (I/O requests take an abnormally long time to complete). See MMFS_LONGDISKIO.