Additional checks on file system availability for CES exported data

A CES cluster exports file systems to its clients by using NFS or SMB. These exports might be fully or partially located on the CES cluster directly, or might be remote-mounted from other storage systems. If such a mount is not available at the time when the NFS or SMB services starts up or at run time, the system throws an error. There are events that set the NFS or SMB state to a DEGRADED or FAILED state in case all the necessary file system are not available.

The NFS and SMB monitoring checks that the file systems required by the declared exports are all available. If one or more of these file systems is unavailable, then they are marked as FAILED in the mmhealth node show filesystem -v command output. The corresponding components of the NFS or SMB are set into a DEGRADED state. For NFS, the nfs_exports_down event is created initially. For SMB, the smb_exports_down event is created initially.

Alternatively, the CES nodes can be set automatically to a FAILED state instead of a DEGRADED state if the required remote or local file systems are not mounted. The change in state can be done only by the Cluster State Manager (CSM). If the CSM detects that some of the CES nodes are in a DEGRADED state, then it can overrule the DEGRADED state with a FAILED state to trigger a failover of the CES-IPs to healthy node.
Note: This overrule is limited to the file-system-related events nfs_exports_down and smb_exports_down. Other events that cause a DEGRADED state are not handled by this procedure.

For NFS, the nfs_exports_down warning event is countered by a nfs_exported_fs_down error event from the CSM to mark it as FAILED. Similarly, for SMB, the smb_exports_down warning event is countered by a smb_exported_fs_down error event to mark it as FAILED.

After the CSM detects that all the CES nodes report a nfs_exports_down or smb_exports_down status, it clears the nfs_exported_fs_down or smb_exported_fs_down events to allow each node to rediscover its own state again. This prevents a cluster outage if only one protocol is affected, but others are active. However, such a state might not be stable for a while and must be fixed as soon as possible. If the file systems are mounted back again, then the SMB or NFS service monitors detect this and are able to refresh their health state information.

This CSM feature can be configured as follows:
  1. Make a backup copy of the current /var/mmfs/mmsysmon/mmsysmonitor.conf file.
  2. Open the file with a text editor, and search for the [clusterstate]section to set the value of csmsetmissingexportsfailed to true or false:
    [clusterstate]
    ...
    
    # true = allow CSM to override NFS/SMB missing export events on the CES nodes (set to FAILED)
    # false = CSM does not override NFS/SMB missing export events on the CES nodes
    csmsetmissingexportsfailed = true
  3. Close the editor and restart the system health monitor using the following command:
    mmsysmoncontrol restart
  4. Run this procedure on all the nodes or copy the modified files to all nodes and restart the system health monitor on all nodes.
Important: During the restart of a node, some internal checks are done by the system health monitor for a file system's availability if NFS or SMB is enabled. These checks detect if all the required file systems for the declared exports are available. There might be cases where file systems are not available or are unmounted at the time of the check. This might be a timing issue, or because some file systems are not automatically mounted. In such cases, the NFS service is not started and remains in a STOPPED state even if all relevant file systems are available at a later point in time.
This feature can be configured as follows:
  1. Make a backup copy of the current mmsysmonitor.conf file.
  2. Open the file with a text editor, and search for the nfs section to set the value of preventnfsstartuponmissingfs to true or false:
    # NFS settings
    #
    [nfs]
    ...
    # prevent NFS startup after reboot/mmstartup if not all required filesystems for exports are available
    # true = prevent startup / false = allow startup
    preventnfsstartuponmissingfs = true
  3. Close the editor and restart the system health monitor using the following command:
    mmsysmoncontrol restart
  4. Run this procedure on all the nodes or copy the modified files to all nodes and restart the system health monitor on all nodes.