Gateway node failure and recovery
When the primary gateway of a fileset fails, another gateway node takes over the ownership of the fileset.
Gateway node failures are not catastrophic and do not result in the loss of data or the loss of the ability of AFM to communicate with the home cluster with updates and revalidations.
AFM internally stores all the information necessary to replay the updates made in the cache at the home cluster. When a gateway node fails, the in-memory queue is lost. The node rebuilt the queue in memory by taking over for the failed gateway. The process of rebuilding the queue is called recovery. As an administrator, ensure that you create sufficient disk space in /var/mmfs/afm for a smooth recovery. For the recovery of 1 million files or inodes, approximately 250 MB disk space is required in /var/mmfs/afm.
During recovery, outstanding cache updates are placed on the in-memory queue and the gateway starts processing the queue. AFM collects the pending operations by running a policy scan in the fileset. AFM uses the policy infrastructure in IBM Storage Scale to engage all the nodes that are mounting the file system. Pending requests are queued in a special queue that is called the priority queue. The priority queue is different from the normal queue where normal requests get queued. After the priority queue is flushed to home, the cache and home become synchronized and recovery is said to be completed and the cache returns to an Active state. In some cases, the files or directories that are deleted at cache might not be deleted at home. Therefore, the files or directories that are not deleted remain at the home. However, the recovery operations remain unaffected.
The beginning and end of the AFM recovery process can be monitored by using the afmRecoveryStart and afmRecoveryEnd callback events.
Recovery is used only for the single-writer mode filesets and the independent-writer mode filesets. It is triggered when the cache fileset attempts to move to the Active state, for example when the fileset is accessed for the first time after the failure.
Recovery can run in parallel on multiple filesets, although only one instance can run on a fileset at a time. The time taken to synchronize the contents of cache with home after the recovery of a gateway node depends on the number of files in the fileset and the number of outstanding changes since the last failure.
mmafmctl fs1 getstate -j filesetSW1 --write-stats
A sample output
is as follows:Fileset Name Total Written Data N/w Throughput Total Pending Data Estimated
(Bytes) (KB/s) to Write(Bytes) Completion time
------------ ------------------- -------------- ------------------- ----------------
filesetSW1 98359590 68 22620600 5 (Min)
- If the afmFastCreate parameter value is set to yes or AFM to cloud object storage is enabled on a fileset, the --read-stats and --write-stats options show information such as N/w Throughput, Total Pending Data, Estimated Completion time only during the recovery or resync operation. During regular operations, the --read-stats or --write-stats option shows only Total Written Data.
- During recovery event, it might take some time for AFM to collect recovery data and queue operations to the AFM gateway node. The synchronization status is not shown until data is queued to the AFM gateway and the write operation is synchronized to the home.
In the multiple AFM caches environment, recovery is triggered after a failure. To limit the maximum number of AFM or AFM-DR filesets that can perform recovery at a time, set the afmMaxParallelRecoveries parameter. The recovery process is run on the number of filesets that you specify for afmMaxParallelRecoveries. The number of filesets that are specified in afmMaxParallelRecoveries are accessed for recovery. After recoveries are complete on these filesets, the next set of filesets is accessed for recovery. By default, afmMaxParallelRecoveries is set to 0, and the recovery process is run on all filesets. Specifying afmMaxParallelRecoveries restricts the number of recoveries, thus conserving hardware resources. For more information, see Configuration parameters for AFM, AFM-DR, and AFM to cloud object storage.
Peer snapshots that are created in cache and queued to home might get lost due to gateway node failure. These peer snapshots cannot be recovered through AFM recovery. For more information, see Peer snapshot -psnap.
If any AFM-enabled fileset has no updates, the failure of a gateway node is harmless and application nodes do not experience delays. During recovery, application requests to all AFM filesets are momentarily blocked.
node2:/gpfs/cache/fileset_SW mmafmctl fs1 getstate
Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec
------------ -------------- ------------- ------------ ------------ -------------
fileset_SW nfs://node4/gpfs/fshome/fset001 FlushOnly node2 0 0
node2:/gpfs/cache/fileset_SW # mmafmctl fs1 getstate
Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec
------------ -------------- ------------- ------------ ------------ -------------
fileset_SW nfs://node4/gpfs/fshome/fset001 Recovery node2 0 0
node2:/gpfs/cache/fileset_SW # mmafmctl fs1 getstate
Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec
------------ -------------- ------------- ------------ ------------ -------------
fileset_SW nfs://node4/gpfs/fshome/fset001 Active node2 0 3
For more
information, see mmafmctl command.An example of the messages in mmfs.log is as follows:
Thu Oct 27 15:28:15 CEST 2016: [N] AFM: Starting recovery for fileset 'fileset_SW' in fs 'fs1'
Thu Oct 27 15:28:15 CEST 2016: mmcommon afmrecovery invoked: device=fs1 filesetId=1....
Thu Oct 27 15:28:16 CEST 2016: [N] AFM: mmafmlocal /usr/lpp/mmfs/bin/mmapplypolicy….
Thu Oct 27 15:28:31.325 2016: [I] AFM: Detecting operations to be recovered...
Thu Oct 27 15:28:31.328 2016: [I] AFM: Found 2 update operations...
Thu Oct 27 15:28:31.331 2016: [I] AFM: Starting 'queue' operation for fileset 'fileset_SW'in filesystem'fs1'.
Thu Oct 27 15:28:31.332 2016: [I] Command: tspcache fs1 1 fileset_SW 0 3 1346604503 38 0 43
Thu Oct 27 15:28:31.375 2016: [I] Command: successful tspcache fs1 1 fileset_SW 0 3 1346604503 38 0 43
Thu Oct 27 15:28:31 CEST 2016: [I] AFM: Finished queuing recovery operations for /gpfs/cache/fileset_SW
Failures during recovery
Filesets can be in recovery state and might not complete recovery due to some conditions. The fileset might go to Dropped or NeedsResync state. This fileset state implies that recovery is failed.
The mmfs.log might contain the following lines: AFM: File system fs1 fileset adrSanity-160216-120202-KNFS-TC11-DRP encountered an error synchronizing with the remote cluster. Cannot synchronize with the remote cluster until AFM recovery is executed. remote error 28.
After the recovery fails, the next recovery is triggered after 120 seconds or when some operation is performed on the SW or IW fileset. After a successful recovery, modifications at the cache are synchronized with the home and the fileset state is 'Active'.
- Check the inode or block quota on cache and at home.
- Ensure that home is accessible. Remount home file system and restart NFS at home.
- Ensure that memory is not reached. If memory is reached, increase afmHardMemThreshold.
- Check network connectivity with home.
- If recovery keeps failing as eviction is triggered due to exceeding block quotas, increase block quotas or disable eviction to get recovery working.
Customizing a storage location to store the AFM recovery data
During the event of recovery, resync, or reconcile triggered on the AFM fileset, AFM stores recovery data under the default /var/mmfs/afm location. If the /var partition is full or not sufficient storage space, then AFM recovery might not complete.
- Set a cluster level config afmRecoveryDir by using the mmchconfig command to define the custom storage path for recovery or reconcile, resync, and then set individual fileset by using the afmRecoveryUseFset=yes using mmchfileset command.
- After the afmRecoveryDir parameter is set at the cluster level, then any specific AFM fileset can be enabled to use the customized location by setting the afmRecoveryUseFset on the fileset. This parameter separates the requirement of enabling recovery for only those fileset that have huge data to process for synchronization during event of recovery or resync.
- If afmRecoveryDir parameter is not set at the cluster level, then AFM uses the default storage path for every fileset which is /var/mmfs/afm.
- If afmRecoveryDir parameter is set at the cluster level and afmRecoveryUseFset is set at the fileset level, then AFM uses the specified custom location for event of recovery, resync or reconcile.
- If afmRecoveryDir parameter is set at the cluster level and
afmRecoveryUseFset is not set on the AFM fileset, then AFM recovery uses the
<fs_path>/<filset_path>/<.ptrash/> directory for recovery data for
that fileset.
- Specify a custom storage location for recovery at the cluster level that has sufficient storage
space.
Valid value is a valid path name.mmchconfig afmRecoveryDir=/bigstore -i
- Enable existing individual AFM fileset to use custom location specified in the
afmRecoveryDir
parameter.
mmchfileset fs1 ADR-1 -p afmRecoveryUseFset=yes,....
The system output is as follows:Fileset ADR-1 changed.
- Unset the afmRecoveryUseFset parameter on an AFM
fileset.
The system output is as follows:mmchfileset fs1 ADR-1 -p afmRecoveryUseFset=no
Valid values for afmRecoveryUseFset parameter is 'yes' and 'no'.Fileset ADR-1 changed.
- When you are creating a AFM fileset to use custom location specified in the
afmRecoveryDir
parameter.
Valid value of the afmRecoveryDir parameter is a valid path name.mmcrfileset fs1 ADR-1 -p afmRecoveryUseFset=yes,....
- Unset the afmRecoveryUseFset parameter on an AFM
fileset.
- Specify a custom storage location for recovery at the cluster level that has sufficient storage
space.