APAR status
Closed as program error.
Error description
A failover situation was generated by the NFS health monitor while a node was expelled in the cluster. The NFS service monitor detected a potential hung situation. As a result a failover was triggered even though the system was able to recover itself after several minutes.
Local fix
The systemhealth monitor can be configured via a configuration option to signal a degraded state (nfs_unresponsive event) instead of triggering a failover (nfs_not_active event, error state).
Problem summary
A failover situation was generated by the NFS health monitor while a node was expelled in the cluster. The NFS service monitor detected a potential hung situation. As a result a failover was triggered even though the system was able to recover itself after several minutes.
Problem conclusion
This problem is fixed in 5.1.1 PTF 2 To see all Spectrum Scale APARs and their respective fix solutions refer to page https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_ apars.html Benefits of the solution: The NFS statistics check period (expect increasing number of NFS operations to signal a healthy NFS service) can now be configured via a new variable (maxwaittime) in the /var/mmfs/mmsysmon/mmsysmonitor.conf file. The default value is set to 70 seconds and can be increased as needed to avoid an unwanted failover if the cluster is in a state which needs more time to recover. Work Around: The systemhealth monitor can be configured via a configuration option to signal a degraded state (nfs_unresponsive event) instead of triggering a failover (nfs_not_active event, error state). Problem trigger: The NFS service monitor detected a potential hung situation, which means that the NFS NULL check failed and the number of internal NFS operations did not increase over a while (around 60 seconds). During that time NFS is in a grace mode (allow previous clients to reclaim their locks) and therefore not able to let new clients start their I/O work. This grace time was not considered by the systemhealth monitor, but it should increase the waiting time. Symptom: Performance Impact/Degradation Platforms affected: ALL Linux OS environments (CES nodes) Functional Area affected: System Health Customer Impact: High Importance
Temporary fix
Comments
APAR Information
APAR number
IJ33367
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
511
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-06-22
Closed date
2021-06-22
Last modified date
2021-06-22
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511"}]
Document Information
Modified date:
23 June 2021