AFM issues

The following table lists the common questions in AFM.

Table 1. Common questions in AFM with their resolution
Question Answer / Resolution
How do I flush requeued messages?

Sometimes, requests in the AFM messages queue on the gateway node get requeued because of errors at the home cluster. For example, if space is not available at the home cluster to perform a new write, a write message that is queued is not successful and gets requeued. The administrator views the failed message being requeued on the Primary gateway. Add more space to the home cluster and run mmafmctl resumeRequeued so that the requeued messages are executed at home again. If mmafmctl resumeRequeued is not run by an administrator, AFM executes the message in the regular order of message executions from the cache cluster to the home cluster.

Running the mmfsadm saferdump afm all command on the gateway node displays the queued messages. The requeued messages are displayed in the dumps. An example:

c12c4apv13.gpfs.net: Normal Queue: (listed by execution order)
 (state: Active) c12c4apv13.gpfs.net: Write [612457.552962]
 requeued file3 (43 @ 293) chunks 0 bytes 0 0
Why is a fileset in the Unmounted or Disconnected state when parallel I/O is set up? Filesets that are using a mapping target go to the Disconnected mode if the NFS server of the Primary gateway is unreachable, even if NFS servers of all participating gateways are reachable. The NFS server of the Primary gateway must be checked to fix this problem.
How do I activate an inactive fileset? The mmafmctl prefetch command without options, where prefetch statistics are procured, activates an inactive fileset.
How do I reactivate a fileset in the Dropped state? The mmafmctl prefetch command without options, where prefetch statistics are procured, activates a fileset in a dropped state.
How to clean unmount the home filesystem if there are caches using GPFS protocol as backend? To have a clean unmount of the home filesystem, the filesystem must first be unmounted on the cache cluster where it is remotely mounted and the home filesystem must be unmounted. Unmounting the remote file system from all nodes in the cluster might not be possible until the relevant cache cluster is unlinked or the local file system is unmounted.

Force unmount, shutdown, or crash of the remote cluster results in panic of the remote filesystem at the cache cluster and the queue is dropped. The next access to the fileset runs the recovery. However, this should not affect the cache cluster.

What should be done if the df command hangs on the cache cluster?

On RHEL 7.0 or later, df does not support hidden NFS mounts. As AFM uses regular NFS mounts on the gateway nodes, this change causes commands like df to hang if the secondary gets disconnected.

The following workaround can be used that allows NFS mounts to continue to be hidden:

Remove /etc/mtab symlink, and create a new file /etc/mtab and copy /proc/mounts to /etc/mtab file during the startup. In this solution, the mtab file might go out of synchronization with /proc/mounts.

What happens when the hard quota is reached in an AFM cache? Like any filesystem that reaches the hard quota limit, requests fail with E_NO_SPACE.
When are inodes deleted from the cache? After an inode is allocated, it is never deleted. The space remains allocated and they are re-used.
If inode quotas are set on the cache, what happens when the inode quotas are reached? Attempts to create new files fail, but cache eviction is not triggered. Cache eviction is triggered only when block quota is reached, not the inode quotas.
How can the cache use more inodes than the home? One way is for file deletions. If a file is renamed at the home site, the file in cache is deleted and created again in cache. This results in the file being assigned a different inode number at the cache site. Also, if a cache fileset is LU mode or SW mode, then there can be changes made at the cache that cause it to be bigger than the home.
Why does fileset go to Unmounted state even if home is accessible on the cache cluster? Sometimes, it is possible that the same home is used by multiple clusters, one set of filesets doing a quiesce turn the home unresponsive to the second cluster's filesets, which show home as unmounted
What could be impact of not running mmafmconfig command despite having a GPFS home? Sparse file support is not present even if home is GPFS. Recovery and many AFM functions do not work. Crashes can happen for readdir or lookup, if the backend is using NSD protocol and remote mount is not available at the gateway node.
What should be done if there are cluster wide waiters but everything looks normal, such as home is accessible from gateway nodes, applications are in progress on the cache fileset? This can happen when the application is producing requests at a faster pace. Check iohist to check disk rates.
Read seems to be stuck/inflight for a long time. What should be done? Restart nfs at home to see if error resolves. Check the status of the fileset using mmafmctl getstate command to see if you fileset is in unmounted state.
The mmfs.log show errors during read such as error 233 : These are temporary issues during read:Tue Feb 16 03:32:40.300 2016: [E] AFM: Read file system fs1 fileset newSanity-160216-020201-KNFS-TC8-SW file IDs [58195972.58251658.-1.-1,R] name file-3G remote error 233 These go away automatically and read should be successful.
Can the home have different sub-directories exported using unique FSIDs, while parent directory is also exported using an FSID? This is not a recommended configuration.
I have a non-GPFS home, I have applications running in cache and some requests are requeued with the following error: SetXAttr file system fs1 fileset sw_gpfs file IDs [-1.1067121.-1.-1,N] name local error 124 mmafmconfig is not setup at home. Running mmafmconfig command at home and relinking cache should resolve this issue.
During failover process, some gateway nodes might show error 233 in mmfs.log. This error is harmless. The failover completes successfully.
Resync fails with No buffer space available error, but mmdiag --memory shows that memory is available. Increase afmHardMemThreshold.
How can I change the mode of a fileset? The mode of an AFM client cache fileset cannot be changed from local-update mode to any other mode; however, it can be changed from read-only to single-writer (and vice versa), and from either read-only or single-writer to local-update. Complete the following steps to change the mode:
  1. Ensure that fileset status is active and that the gateway is available.
  2. Unmount the file system
  3. Unlink the fileset.
  4. Run the mmchfileset command to change the mode.
  5. Mount the file system again.
  6. Link the fileset again.
Why are setuid or setgid bits in a single-writer cache reset at home after data is appended? The setuid or setgid bits in a single-writer cache are reset at home after data is appended to files on which those bits were previously set and synced. This is because over NFS, a write operation to a setuid file resets the setuid bit.
How can I traverse a directory that is not cached? On a fileset whose metadata in all subdirectories is not cached, any application that optimizes by assuming that directories contain two fewer subdirectories than their hard link count do not traverse the last subdirectory. One such example is find; on Linux®, a workaround for this is to use find -noleaf to correctly traverse a directory that has not been cached
What extended attribute size is supported? For an operating system in the gateway whose Linux kernel version is below 2.6.32, the NFS max rsize is 32K, so AFM does not support an extended attribute size of more than 32K on that gateway.
What should I do when my file system or fileset is getting full? The .ptrash directory is present in cache and home. In some cases, where there is a conflict that AFM cannot resolve automatically, the file is moved to .ptrash at cache or home. In cache the .ptrash gets cleaned up when eviction is triggered. At home, it is not cleared automatically. When the administrator is looking to clear some space, the .ptrash must be cleaned up first.
How to restore an unmounted AFM fileset that uses GPFS™ protocol as backend? If the NSD mount on the gateway node is unresponsive, AFM does not synchronize data with home. The filesystem might be unmounted at the gateway node. A message AFM: Remote filesystem remotefs is panicked due to unresponsive messages on fileset <fileset_name>,re-mount the filesystem after it becomes responsive. mmcommon preunmount invoked. File system: fs1 Reason: SGPanic is written to mmfs.log. After the home is responsive, you must restore the NSD mount on the gateway node.