Customizing failover behavior for unresponsive NFS

The NFS monitor raises a nfs_not_active error event if it detects a potential NFS hung situation. This triggers a failover, which impacts the system's overall performance. However, the NFS might not be hung in such a situation. It is possible that the NFS is fully working even if it does not react on the monitor checks at that specific time. In these cases, it would be better to trigger a warning instead of a failover. The NFS monitor can be configured to send an nfs_unresponsive warning event instead of the nfs_not_active event if it detects a potential hung situation.

Important: The type of NFS workload the cluster experiences determines whether or not the tunable setting is needed or not.

The nfs_unresponsive event configuration can be done in the monitor configuration file /var/mmfs/mmsysmon/mmsysmonitor.conf. In the mmsysmonitor.conf file, the nfs section contains the new flag failoverunresponsivenfs

Setting the failoverunresponsivenfs flag to false triggers the WARNING event, nfs_unresponsive, if the NFS does not respond to NULL requests or has no measurable NFS operation activity. Setting the warning event instead of an error event ensures that the NFS service is not interrupted. This allows the system to avoid an unnecessary failover in case the monitor cycles detect a healthy state again for the NFS later. However, if the NFS is hung, there is no automatic recovery even if the NFS remains hung for a long time. It is the user's responsibility to check the system and to restart NFS manually, if needed.

Setting the failoverunresponsivenfs flag to true triggers the ERROR event, nfs_not_active, if the NFS does not respond to NULL requests or has no measurable NFS operation activity. Setting the error event instead of a warning event ensures that a failover is triggered when the system detects a potential NFS hung situation. However, if the flag is set to true, a failover might be triggered even if the NFS server is not hung, but just overloaded.

Note: You must restart the system health monitor by using the mmsysmoncontrol restart command to make the changes effective.