IBM Support

IBM Elastic Storage System: GNR slow disk detection has a new setting to make it more sensitive to remove problematic disks from service faster.

Troubleshooting


Problem

GNR uses a patented slow disk detection algorithm to find disks that are slow when compared to their peers.
The algorithm attempts to exploit knowledge of the disk’s hardware layout to categorize observations into separate groups when looking for slower disks.
As an example, this binning strategy allows the system to avoid comparing a disk that is mostly doing large random writes to a disk that is doing small sequential reads (this would not be a fair comparison).
However, changes in disk technology over the past decade have made this categorization less useful, as capabilities such as nonvolatile caching add significant noise to the GNR’s slow disk detection binning scheme, making it harder to properly classify observations.
The net result is that disks that appear visibly slow can be classified by the algorithm as “not-slow” and will impact the quality of service of the system.

Symptom

A disk appears as visibly slow in mmfs.log or the recovery group event log with excessive timeouts, and the timeouts appear to be isolated to that single disk. The disk remains in the system for a long time, and the relative performance field visible from mmlspdisk or mmvdisk pdisk list -L is still close to 1.0 (where the slow threshold is 2.0).
# mmvdisk pdisk list --rg rgR --pdisk e1d2s42 -L
pdisk:
replacementPriority = 1000
name = "e1d2s42"
device =
"//c50f04n03/dev/sdlh,//c50f04n03/dev/sdxx,//c50f04n04/dev/sdlh(notEnabled),//c50f04n04/dev/sdxx(notEnable
d)"
recoveryGroup = "rgR"
declusteredArray = "DA2"
state = "ok"
internalState = 00000.000
capacity = 8001524072448
freeSpace = 5252745003008
fru = "FRU.8TB.A"
location = "SHG1000179Y0MVR-2-42"
WWN = "naa.5000C500862D3787"
server = "c50f04n03.gpfs.net"
reads = 8814393
writes = 8999238
bytesReadInGiB = 4312.930
bytesWrittenInGiB = 4397.184
IOErrors = 0
IOTimeouts = 16
mediaErrors = 0
checksumErrors = 0
pathErrors = 0
relativePerformance = 1.068
bitErrorRate = 0.000e+00
rgIndex = 8
userLocation = "Enclosure SHG1000179Y0MVR Drawer 2 Slot 42"
hardware = "IBM-ESXS ST8000NM0095 E5 ECE4 ZA157KN50000R707LQFM"
hardwareType = Rotating 7200
nPaths = 2 active 4 total
sedSupported = Yes

Environment

This enhancement is only applicable to HDD disks (not NVMe/SSD) running on a GNR platform.

Resolving The Problem

Starting in ESS 6.1.6.0 and ECE 5.1.7.0, a new mmchconfig parameter called nsdRAIDDiskPerformanceCollapseICTBins was introduced that enables an update to the slow disk detection algorithm to make the algorithm more sensitive to extremely slow disks.
Beginning with ESS 6.1.9.0, the default has been changed to "yes".  Prior to ESS 6.1.9.0, the default value was “no” but could be set dynamically to “yes” to enable or disable the feature. 
It is recommended that when enabling this feature, that the user apply it to the node class associated with the servers of a given recovery group.
With this feature enabled, there may be an increase in disks being flagged as slow, so a corresponding increase in the number of disk replacements is possible.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSZL24","label":"IBM Elastic Storage System"},"ARM Category":[{"code":"a8m50000000KzegAAC","label":"GNR"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
20 December 2023

UID

ibm16983561