Gateway node failure and recovery

When the Primary gateway of a fileset fails, another gateway node takes over the ownership of the fileset.

Gateway node failures are not catastrophic and do not result in the loss of data or the loss of the ability of AFM to communicate with the home cluster with updates and revalidations.

AFM internally stores all the information necessary to replay the updates made in the cache at the home cluster. When a gateway node fails the in-memory queue is lost. The queue is rebuilt in memory by the node taking over for the failed gateway. The process of rebuilding the queue is called Recovery. As an administrator, ensure that you create sufficient disk space in /var/mmfs/afm for a smooth recovery. Approximately, one million files require 250 MB disk space in /var/mmfs/afm.

During recovery, outstanding cache updates are placed on the in-memory queue and the gateway starts processing the queue. AFM collects the pending operations by running a policy scan in the fileset. AFM uses the policy infrastructure in IBM Spectrum Scale to engage all the nodes mounting the file system. Pending requests are queued in a special queue called the priority queue which is different from the normal queue where normal requests get queued. After the priority queue is flushed to home, the cache and home become synchronized and recovery is said to be completed and the cache returns to an Active state. In some cases, the files or directories deleted at cache might not be deleted at home. Hence, these undeleted files or directories remain at home. However, the recovery operations remain unaffected.

The beginning and end of the AFM recovery process can be monitored by using the afmRecoveryStart and afmRecoveryEnd callback events.

Recovery is only used for single-writer and the independent-writer mode filesets. It is triggered when the cache fileset attempts to move to Active state, for example when the fileset is accessed for the first time after the failure.

Recovery can run in parallel on multiple filesets, although only one instance can run on a fileset at a time. The time taken to synchronize the contents of cache with home after the recovery of a gateway node depends on the number of files in the fileset and the number of outstanding changes since the last failure.

In multiple AFM cache environment, recovery is triggered after a failure. You can configure afmMaxParallelRecoveries to specify the number of filesets in the cluster on all filesystems. The recovery process is run on the number of filesets you specify for afmMaxParallelRecoveries. The number of filesets specified in afmMaxParallelRecoveries are accessed for recovery. After recoveries are complete on these filesets, the next set of filesets are accessed for recovery. By default, afmMaxParallelRecoveries is set to 0, and the recovery process is run on all filesets. Specifying afmMaxParallelRecoveries restricts the number of recoveries, thus conserving hardware resources. For more information, see Configuration parameters for AFM.

Peer snapshots created in cache and queued to home might get lost due to gateway node failure. These peer snapshots cannot be recovered through AFM recovery. For details, see Peer snapshot -psnap.

If there are no updates to any AFM-enabled fileset, the failure of a gateway node is harmless and application nodes do not experience delays. During recovery, application requests to all AFM filesets are momentarily blocked.

The following example indicates changes in the AFM fileset state during recovery:

node2:/gpfs/cache/fileset_SW # mmafmctl fs1 getstate
         Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec
         ------------ -------------- ------------- ------------ ------------ -------------
         fileset_SW nfs://node4/gpfs/fshome/fset001 FlushOnly node2 0 0  
         
         node2:/gpfs/cache/fileset_SW # mmafmctl fs1 getstate
         Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec
         ------------ -------------- ------------- ------------ ------------ -------------
         fileset_SW nfs://node4/gpfs/fshome/fset001 Recovery node2 0 0 
       
         node2:/gpfs/cache/fileset_SW # mmafmctl fs1 getstate
        Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec
        ------------ -------------- ------------- ------------ ------------ ------------- 
        fileset_SW nfs://node4/gpfs/fshome/fset001 Active node2 0 3

For more information, see mmafmctl command.

An example of the messages in mmfs.log is as follows:

Thu Oct 27 15:28:15 CEST 2016: [N] AFM: Starting recovery for fileset 'fileset_SW' in fs 'fs1'
Thu Oct 27 15:28:15 CEST 2016: mmcommon afmrecovery invoked: device=fs1 filesetId=1....
Thu Oct 27 15:28:16 CEST 2016: [N] AFM: mmafmlocal /usr/lpp/mmfs/bin/mmapplypolicy….
Thu Oct 27 15:28:31.325 2016: [I] AFM: Detecting operations to be recovered...
Thu Oct 27 15:28:31.328 2016: [I] AFM: Found 2 update operations...
Thu Oct 27 15:28:31.331 2016: [I] AFM: Starting 'queue' operation for fileset 'fileset_SW'in filesystem'fs1'.
Thu Oct 27 15:28:31.332 2016: [I] Command: tspcache fs1 1 fileset_SW 0 3 1346604503 38 0 43
Thu Oct 27 15:28:31.375 2016: [I] Command: successful tspcache fs1 1 fileset_SW 0 3 1346604503 38 0 43
Thu Oct 27 15:28:31 CEST 2016: [I] AFM: Finished queuing recovery operations for /gpfs/cache/fileset_SW

Failures during recovery

Filesets can be in recovery state and may not complete recovery due to some conditions. The fileset might go to Dropped or NeedsResync state, implying that recovery has failed.

The mmfs.log might contain the following lines: AFM: File system fs1 fileset adrSanity-160216-120202-KNFS-TC11-DRP encountered an error synchronizing with the remote cluster. Cannot synchronize with the remote cluster until AFM recovery is executed. remote error 28.

After the recovery fails, the next recovery is triggered after 120 seconds or when some operation is performed on the SW or IW fileset. After a successful recovery, modifications at the cache are synchronized with the home and the fileset state is 'Active'.

The following checks are needed from the administrator to ensure that the next recovery is successful:

Check the inode or block quota on cache and at home.
Ensure home is accessible. Remount home filesystem and restart NFS at home.
Ensure memory is not reached. If memory is reached, increase afmHardMemThreshold.
Check network connectivity with home.
If recovery keeps failing as eviction is triggered due to exceeding block quotas, increase block quotas or disable eviction to get recovery working.