AFM DR issues

This topic lists the answers to the common AFM DR questions.

Table 1. Common questions in AFM DR with their resolution
Issue Resolution
How do I flush requeued messages?

Sometimes, requests in the AFM messages queue on the gateway node get requeued due to errors at the home cluster. For example, if space is not available at the home cluster to perform a new write, a write message that is queued is not successful and gets requeued. The administrator views the failed message being requeued on the MDS. Add more space to the home cluster and run mmafmctl resumeRequeued so that the requeued messages are executed at home again. If mmafmctl resumeRequeued is not run by an administrator, AFM executes the message in the regular order of message executions from the cache cluster to the home cluster. Running mmfsadm saferdump afm all on the gateway node displays the queued messages. The requeued messages are displayed in the dumps. An example:

c12c4apv13.gpfs.net: Normal Queue: (listed by execution order) (state: Active)c12c4apv13.gpfs.net: Write [612457.552962] requeued file3 (43 @ 293) chunks 0 bytes 0 0

Why is a fileset in the Unmounted or Disconnected state when parallel I/O is set up? Filesets that are using a mapping target go to the Disconnected mode if the NFS server of the MDS is unreachable, even if NFS servers of all participating gateways are reachable. The NFS server of the MDS must be checked to fix this problem.
How to clean unmount of the secondary filesystem fails if there are caches using GPFS protocol as backend?

To have a clean unmount of secondary filesystem, the filesystem should first be unmounted on the primary cluster where it has been remotely mounted and then the secondary filesystem should be unmounted. It might not be possible to unmount the remote file system from all nodes in the cluster until the relevant primary is unlinked or the local file system is unmounted.

Force unmount/shutdown/crash of remote cluster results panic of the remote filesystem at primary cluster and queue gets dropped, next access to fileset runs recovery. However this should not affect primary cluster.

‘DF’ command hangs on the primary cluster

On RHEL 7.0 or later, df does not support hidden NFS mounts. As AFM uses regular NFS mounts on the gateway nodes, this change causes commands like df to hang if the secondary gets disconnected. The following workaround can be used that allows NFS mounts to continue to be hidden:

Remove /etc/mtab symlink, and create new file /etc/mtab and copy /proc/mounts to /etc/mtab file during startup. In this solution, mtab file might go out of sync with /proc/mounts

What does NeedsResync state imply ?

NeedsResync state does not necessarily mean a problem. If this state is during a conversion or recovery, the problem gets automatically fixed in the subsequent recovery. You can monitor the mmafmctl $fsname getstate to check if its queue number is changing. And also can check the gpfs logs and for any errors, such as unmounted.

Is there a single command to delete all RPO snapshots from a primary fileset? No. All RPOs need to be manually deleted.
Suppose there are more than two RPO snapshots on the primary. Where did these snapshots come from? Check the queue. Check if recovery happened in the recent past. The extra snapshots will get deleted during subsequent RPO cycles.
How to restore an unmounted AFM DR fileset that uses GPFS™ protocol as backend? If the NSD mount on the gateway node is unresponsive, AFM DR does not synchronize data with secondary. The filesystem might be unmounted at the gateway node. A message AFM: Remote filesystem remotefs is panicked due to unresponsive messages on fileset <fileset_name>,re-mount the filesystem after it becomes responsive. mmcommon preunmount invoked. File system: fs1 Reason: SGPanic is written to mmfs.log. After the secondary is responsive, you must restore the NSD mount on the gateway node.