What to do if a recovery group stops service when a disk hangs because of hardware failure

The topic describes the steps that need to be taken if a recovery group stops service. A recovery group stops service when a disk hangs because of hardware failure.

There are two types of requests that are sent to a disk. One is a general I/O read or write request and the other is a pass-through query request.

IBM Storage Scale has two configuration parameters that control the response when a disk hang problem occurs. The panicOnIOHang parameter value is set to yes on storage servers by default. When a disk request hangs in the kernel longer than the defined time (300 seconds by default) of the ioHangDetectorTimeout parameter, IBM Storage Scale reboots the node automatically.

When the request to disk hangs, you can see long waiters as follows:

# mmdiag --waiters|grep NSPDServerIOWorkerThread
Waiting 159.2446 sec since 2022-04-21_06:20:57, monitored, thread 64855 NSPDServerIOWorkerThread: for I/O completion on disk sda

or

# mmdiag --waiters|grep DiscoverAndOpenNSPDThread
 Waiting 71.8359 sec since 2022-04-22_01:57:17, monitored, thread 257458 DiscoverAndOpenNSPDThread: for read SCSI world-wide name on disk /dev/sdh
Note: SCSI waiters might also appear on NVMe storage as SCSI-to-NVMe emulation method in GNR.

When the request hang is detected and IBM Storage Scale reboots the node, the mmfs log will have the following messages:

# cat /var/adm/ras/mmfs.log.previous |grep "Kernel I/O"
2022-04-21_06:23:38.892-0400: [E] Kernel I/O hang detected on /dev/sde: write sector 438921496 length 8 pending 305 seconds

or

# cat /var/adm/ras/mmfs.log.previous |grep "Kernel SCSI I/O"
2022-04-22_02:00:50.290-0400: [E] Kernel SCSI I/O hang detected on /dev/nvme1n1, reason: 'get the port addresses', pending 312 seconds

The vmcore-dmesg from the crash can also be used to check the reboot reason if vmcore is generated:

# cat vmcore-dmesg.txt |grep -i "kernel panic"
[33801.657931] <5>kp 20759: cxiPanic: Forcing kernel panic to clear hung I/O
[33801.657934] Kernel panic - not syncing: cxiPanic: Forcing kernel panic to clear hung I/O

If the panicOnIOHang parameter is set to no, ECE will not reboot the node; it will call the user exit callback instead. The user exit callback event diskIOHang can be used to monitor the issue and perform the user-defined operations.

Examples

  1. Create an executable script file:
    # cat /home/iohang
    #!/bin/bash
    # Adding user-defined operations here
    echo $@ > /tmp/iohang.out.`date "+%Y-%m-%d_%H_%M_%S"`
    # chmod +x /home/iohang
  2. Registers user-defined command to the callback event diskIOHang.
    # mmaddcallback iohang --command=/home/iohang --event diskIOHang --parms "%diskName %reason"
  3. When disk request hangs longer than ioHangDetectorTimeout parameter defined, the user exit callback will be triggered. For this example, the following file is generated:
    # cat /tmp/iohang.out.2022-04-21_06_31_42
    /dev/sda Block I/O
    

    or

    # cat /tmp/iohang.out.2022-04-21_02_00_14
    /dev/sdg execute SCSI command Read (10)(0x28)
    
    Note: The diskName parameter returned by callback is the device name which has the request hangs. The reason parameter is the request type sent to the device.
Table 1. The reason parameter callback returned to the user executable script can have the listed types
Parameter Types
Reason Block I/O
read SCSI world-wide name
get the port addresses
get the medium rotation rate
standard inquiry:get vendor/product information

execute SCSI command %s(0x%02x)

For example:

execute SCSI command Inquiry (0x12)

execute SCSI command Test Unit Ready (0x00)

execute SCSI command Read (10)(0x28)