IJ33367: ENHANCE NFS HEALTH STATE CHECK PERIOD WHILE IN GRACE MODE

APAR status

Closed as program error.

Error description

A failover situation was generated by the NFS
health monitor while a node was
expelled in the cluster.
The NFS service monitor detected
a potential hung situation.
As a result a failover was triggered even
though the system was able to
recover itself after several minutes.

Local fix

The systemhealth monitor can be configured via
a configuration option to signal a degraded state
(nfs_unresponsive event) instead of triggering a
failover (nfs_not_active event, error state).

Problem summary

A failover situation was generated by the NFS
health monitor while a node was
expelled in the cluster.
The NFS service monitor detected
a potential hung situation.
As a result a failover was triggered even
though the system was able to
recover itself after several minutes.

Problem conclusion

This problem is fixed in 5.1.1  PTF 2
To see all Spectrum Scale APARs and their
respective fix solutions refer to page
https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
apars.html
Benefits of the solution:
The NFS statistics check period (expect increasing
number of NFS operations to
signal a healthy NFS service)
can now be configured via a new variable (maxwaittime)
in the /var/mmfs/mmsysmon/mmsysmonitor.conf file.
The default value is set to 70 seconds and can be
increased as needed to avoid an
unwanted failover if the cluster
is in a state which needs more time to recover.

Work Around:
The systemhealth monitor can be configured via
a configuration option to signal a degraded state
(nfs_unresponsive event) instead of triggering a
failover (nfs_not_active event, error state).

Problem trigger:
The NFS service monitor detected a potential
hung situation, which means that the
NFS NULL check failed and
the number of internal NFS operations did not
increase over a while (around 60 seconds).
During that time NFS is in a grace mode
(allow previous clients to reclaim their locks)
and therefore not able to let
new clients start their I/O work.
This grace time was not considered by the
systemhealth monitor, but it should
increase the waiting time.

Symptom:
Performance Impact/Degradation

Platforms affected:
ALL Linux OS environments (CES nodes)

Functional Area affected:
System Health

Customer Impact:
High Importance

Temporary fix

Comments

APAR Information

APAR number
IJ33367
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
511
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-06-22
Closed date
2021-06-22
Last modified date
2021-06-22

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511"}]

Document Information

Modified date:
23 June 2021

Tips

IJ33367: ENHANCE NFS HEALTH STATE CHECK PERIOD WHILE IN GRACE MODE

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?