What to do if a recovery group stops service when a disk hangs because of hardware failure
The topic describes the steps that need to be taken if a recovery group stops service. A recovery group stops service when a disk hangs because of hardware failure.
There are two types of requests that are sent to a disk. One is a general I/O read or write request and the other is a pass-through query request.
IBM Storage Scale has two configuration parameters that control the response when a disk hang problem occurs. The panicOnIOHang parameter value is set to yes on storage servers by default. When a disk request hangs in the kernel longer than the defined time (300 seconds by default) of the ioHangDetectorTimeout parameter, IBM Storage Scale reboots the node automatically.
When the request to disk hangs, you can see long waiters as follows:
# mmdiag --waiters|grep NSPDServerIOWorkerThread
Waiting 159.2446 sec since 2022-04-21_06:20:57, monitored, thread 64855 NSPDServerIOWorkerThread: for I/O completion on disk sda
or
# mmdiag --waiters|grep DiscoverAndOpenNSPDThread
Waiting 71.8359 sec since 2022-04-22_01:57:17, monitored, thread 257458 DiscoverAndOpenNSPDThread: for read SCSI world-wide name on disk /dev/sdh
When the request hang is detected and IBM Storage Scale reboots the node, the mmfs log will have the following messages:
# cat /var/adm/ras/mmfs.log.previous |grep "Kernel I/O"
2022-04-21_06:23:38.892-0400: [E] Kernel I/O hang detected on /dev/sde: write sector 438921496 length 8 pending 305 seconds
or
# cat /var/adm/ras/mmfs.log.previous |grep "Kernel SCSI I/O"
2022-04-22_02:00:50.290-0400: [E] Kernel SCSI I/O hang detected on /dev/nvme1n1, reason: 'get the port addresses', pending 312 seconds
The vmcore-dmesg from the crash can also be used to check the reboot reason if vmcore is generated:
# cat vmcore-dmesg.txt |grep -i "kernel panic"
[33801.657931] <5>kp 20759: cxiPanic: Forcing kernel panic to clear hung I/O
[33801.657934] Kernel panic - not syncing: cxiPanic: Forcing kernel panic to clear hung I/O
If the panicOnIOHang parameter is set to no, ECE will not reboot the node; it will call the user exit callback instead. The user exit callback event diskIOHang can be used to monitor the issue and perform the user-defined operations.
Examples
- Create an executable script file:
# cat /home/iohang #!/bin/bash # Adding user-defined operations here echo $@ > /tmp/iohang.out.`date "+%Y-%m-%d_%H_%M_%S"` # chmod +x /home/iohang
- Registers user-defined command to the callback event
diskIOHang.
# mmaddcallback iohang --command=/home/iohang --event diskIOHang --parms "%diskName %reason"
- When disk request hangs longer than ioHangDetectorTimeout parameter
defined, the user exit callback will be triggered. For this example, the following file is
generated:
# cat /tmp/iohang.out.2022-04-21_06_31_42 /dev/sda Block I/O
or
# cat /tmp/iohang.out.2022-04-21_02_00_14 /dev/sdg execute SCSI command Read (10)(0x28)
Note: The diskName parameter returned by callback is the device name which has the request hangs. The reason parameter is the request type sent to the device.
Parameter | Types |
---|---|
Reason | Block I/O |
read SCSI world-wide name | |
get the port addresses | |
get the medium rotation rate | |
standard inquiry:get vendor/product information | |
execute SCSI command %s(0x%02x) For example: execute SCSI command Inquiry (0x12) execute SCSI command Test Unit Ready (0x00) execute SCSI command Read (10)(0x28) |