Monitoring fileset states for AFM

AFM fileset can have different states depending on the mode and queue states.

To view the current cache state, run the
mmafmctl  filesystem getstate
command, or the
mmafmctl  filesystem getstate -j cache_fileset
command.
See the following table for the explanation of the cache state:
Table 1. AFM states and their description
AFM fileset state Condition Description Healthy or Unhealthy Administrator's action
Inactive The AFM cache is created Operations were not initiated on the cache cluster after the last daemon restart. Healthy None
FlushOnly Operations are queued Operations have not started to flush. Healthy This is a temporary state and should move to Active when a write is initiated.
Active The AFM cache is active The cache cluster is ready for an operation. Healthy None
Dirty The AFM is active The pending changes in the cache cluster are not played at the home cluster. This state does not hamper the normal activity. Healthy None
Recovery The cache is accessed after primary gateway failure A new gateway is taking over a fileset as primary gateway after the old primary gateway failed. Healthy None
QueueOnly The cache is running some operation. Operations such as recovery, resync, failover are being executed, and operations are being queued and not flushed. Healthy This is a temporary state.
Disconnected Primary gateway cannot connect to the NFS server at the home cluster. Occurs only in a cache cluster that is created over an NFS export. When parallel data transfer is configured, this state shows the connectivity between the primary gateway and the mapped home server, irrespective of other gateway nodes. Unhealthy Correct the errant NFS servers on the home cluster.
Unmounted The cache that is using NFS has detected a change in the home cluster - sometimes during creation or in the middle of an operation if home exports are meddled with.
  • The home NFS is not accessible.
  • The home exports are not exported properly.
  • The home export does not exist.
Unhealthy
  1. Fix the NFS export issue in the Home setup section and retry for access.
  2. Relink the cache cluster if the cache cluster does not recover.
After mountRetryInterval of the primary gateway, the cache cluster retries connecting with home.
Unmounted The cache that is using the GPFS protocol detects a change in the home cluster, sometimes during creation or in the middle of an operation. There are problems accessing the local mount of the remote file system. Unhealthy Check remote filesystem mount on the cache cluster and remount if necessary.
Dropped Recovery failed. The local file system is full, space is not available on the cache or the primary cluster, or case of a policy failure during recovery. Unhealthy Fix the issue and access the fileset to retry recovery.
Dropped IW Failback failed. The local file system is full, space is not available on the cache or the primary cluster, or there is a policy failure during recovery. Unhealthy Fix the issue and access the fileset to retry failback.
Dropped A cache with active queue operations is forcibly unlinked. All queued operations are being de-queued, and the fileset remains in the Dropped state and moves to the Inactive state when the unlinking is complete. Healthy This is a temporary state.
Dropped The old GW node starts functioning properly after a failure. AFM internally performs queue transfers from one gateway to another to handle gateway node failures. Healthy The system resolves this state on the next access.
Dropped Cache creation or in the middle of an operation if the home exports changed.

Export problems at home such as following:

  • The home path is not exported on all NFS server nodes that are interacting with the cache clusters.
  • The home cluster is exported after the operations have started on the fileset. Changing fsid on the home cluster after the fileset operations have begun.
  • All home NFS servers do not have the same fsid for the same export path.
Unhealthy
  1. Fix the NFS export issue in the Home setup section and retry for access.
  2. Relink the cache cluster if the cache cluster does not recover.
After mountRetryInterval the primary gateway retries connecting with home cluster.
Dropped During recovery or normal operation If gateway queue memory is exceeded, the queue can get dropped. The memory has to be increased to accommodate all requests and bring the queue back to the Active state. Unhealthy Increase afmHardMemThreshold.
Expired The RO cache that is configured to expire. An event that occurs automatically after prolonged disconnection when the cached contents are not accessible. Unhealthy Fix the errant NFS servers on the home cluster
NeedsFailback The IW cache that needs to complete failback. A failback initiated on an IW cache cluster is interrupted and is incomplete. Unhealthy Failback is automatically triggered on the fileset, or the administrator can run failback again.
FailbackInProgress Failback initiated on IW cache. Failback is in progress and automatically moves to failbackCompleted Healthy None
FailbackCompleted The IW cache after failback. Failback successfully completes on the IW cache cluster. Healthy Run mmafmctl failback --stop on the cache cluster.
NeedsResync The SW cache cluster during home corruption. Occurs when the home cluster is accidentally corrupted Unhealthy Run mmafmctl resync on the cache.
NeedsResync Recovery on the SW cache. A rare state possible only under error conditions during recovery Unhealthy No administrator action required. The system would fix this in the subsequent recovery.
Stopped Replication stopped on fileset. Fileset stops sending changes to the gateway node. Mainly used during planned downtime. Unhealthy After planned downtime, run mmafmctl <fs> start -j <fileset> to start sending changes/modification to the gateway node and continue replication.