APAR status
Closed as program error.
Error description
The NFS monitor checks the health state of a running NFS instance periodically. Sometimes the NFS service does not react on some "alive" check commands, and that is interpreted as a potential "hung" state. Based on the configuration in the mmsysmonitor.conf file either a failover or just a warning is triggered then.
Local fix
The behavior or a detected potential "hung" state can be customized with the flag 'failoverunresponsivenfs' in the mmsysmonitor.conf file, section [nfs].
Problem summary
Problem description: The NFS monitor checks the health state of a running NFS instance periodically. Sometimes the NFS service does not react on some "alive" check commands, and that is interpreted as a potential "hung" state. Based on the configuration in the mmsysmonitor.conf file either a failover or just a warning is triggered then.
Problem conclusion
Benefits of the solution: The fix increases the time span between internal checks up to a minute until a decision about a detected "hung" state is made. This is much more reliable than the previous approach with around 10-20 seconds. Work Around: The behavior or a detected potential "hung" state can be customized with the flag 'failoverunresponsivenfs' in the mmsysmonitor.conf file, section [nfs]. The meaning of the flag value is: "true" = set an ERROR event (nfs_not_active) if NFS does not respond to NULL requests and has no measurable NFS operation activity "false" = set an DEGRADED event (nfs_unresponsive) if NFS does not respond to NULL requests and has no measurable NFS operation activity The monitor needs to be restarted after a change (mmsysmoncontrol restart). The change must be done on all nodes in the same way. Problem trigger: In some cases high I/O load lead to the situation that NFS v3 and/or v4 NULL requests failed, and that a following internal statistics check reported no activity in respect to the number of internal NFS operations. These checks are done within a timespan of several seconds to a minute. In fact, the system might be still functional, and the internally detected "unresponsive" state might be just temporarily so that a failover would not be advised in this case. The monitor interprets the "unresponsiveness" as a potential "hung" state, and triggers either a failover or a warning, dependent on the configuration settings. Symptom: Performance Impact/Degradation Platforms affected: ALL Linux OS environments (CES nodes) Functional Area affected: Systemhealth Customer Impact: High Importance
Temporary fix
Comments
APAR Information
APAR number
IJ18591
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
503
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-08-28
Closed date
2019-08-28
Last modified date
2019-08-28
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
IJ18744
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"503","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
28 August 2019