Failover and failback

Failover

After a failure of the primary, applications can fail over to the secondary fileset. Performing a failover operation on the secondary fileset converts it to the primary and makes it writable. AFM DR failover has two modes: “As is” and “Revert to the latest RPO snapshot on the secondary”:
  • As is:

    The “as is” data corresponds to the last byte of data that was copied to the secondary cluster before the primary cluster failed. This is always the latest data available on the secondary cluster. It is the recommended option and also the default behavior.

  • Revert to the latest RPO snapshot on the secondary:

    Restoring data from the last RPO snapshot results in data being restored from the past. As a result, this process is not recommended in AFM DR as it might lead to data loss. For example, although the RPO interval was set to 24 hours, the failover occurred at the 23rd hour, then restoring from the last snapshot (taken 24 hours ago) can result all the data that is created in the last 23 hours being lost. This failover option can take extra time as the data from the snapshot must be restored completely before the application can start by using the fileset.

Failback

There are two options in AFM DR to failback to the original primary:
  • Failback Option 1: New primary

    Create the primary from scratch because all the data in the primary was lost. To create a primary from scratch, all the data needs to be copied to the original primary from the active primary by using whichever mechanism the administrator chooses.

  • Fail back Option 2: Temporary loss of primary

    A temporary failure of the primary that results in a failover of the application to the secondary. To reestablish the original primary:

    AFM DR uses the last RPO snapshot to calculate exactly what data changed on the secondary since the failover occurred. So for this to work there needs to be a common snapshot between the primary and secondary. The following steps give a simplified illustration of such a scenario:

    1. A PSNAP was created between the primary and secondary at 8 AM. At 8 AM, the snapshot in the primary and secondary have the exact same data.
    2. At 9 AM, the primary fails.
    3. At 9 AM, the applications failover to secondary.
    4. Primary is back up at 10 AM that is 2 hours later.
    5. Restore the T0 snapshot on the primary fileset.
    6. Restore the 8 AM snapshot on the primary fileset.
    7. Copy the changed data that is calculated in step 6 from the secondary fileset to the primary fileset. Now, the primary is in sync with the secondary fileset.
Note: For more information about the failback, see Failback procedures.

For a temporary primary failure, the PSNAP (peer-to-peer snapshot) mechanism can be used to periodically take synchronized snapshot between the primary and the secondary. The snapshots can be scheduled at fairly large intervals, for example, one or two times a day. Only the data created in the secondary since the last PSNAP needs to be copied back to the primary during a failback.

Note:
The amount of data copied back to the primary depends on the following:
  • How long the primary is unavailable.
  • The amount of data that is created since the primary failure.
  • How long the failure was from the last snapshot. For example, if the RPO snapshot is captured every 24 hours and the primary fails at the 23rd hour, then changes for the last 23 hours need to be recovered.