ApplicationSet-based application failover between managed clusters
Fail over an ApplicationSet-based application from a primary managed cluster to a secondary managed cluster to maintain application availability during a disaster or cluster failure.
Before you begin
- When primary cluster is in a state other than Ready, check the actual
status of the cluster as it might take some time to update.
- Navigate to tab.
- Check the status of both the managed clusters individually before performing a failover operation.
However, failover operation can still be run when the cluster you are failing over to is in a Ready state.
- Run the following command on the Hub Cluster to check if
lastGroupSyncTimeis within an acceptable data loss window, when compared to current time.oc get drpc -o yaml -A | grep lastGroupSyncTimeExample output:
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"
About this task
Failover is a process that transitions an application from a primary cluster to a secondary cluster in the event of a primary cluster failure. While failover provides the ability for the application to run on the secondary cluster with minimal interruption, making an uninformed failover decision can have adverse consequences, such as complete data loss in the event of unnoticed replication failure from primary to secondary cluster. If a significant amount of time has gone by since the last successful replication, it’s best to wait until the failed primary is recovered.
LastGroupSyncTime is a critical metric that reflects the time since the last
successful replication occurred for all PVCs associated with an application. In essence, it measures
the synchronization health between the primary and secondary clusters. So, prior to initiating a
failover from one cluster to another, check for this metric and only initiate the failover if the
LastGroupSyncTime is within a reasonable time in the past.
Procedure
If volume synchronization does not occur after a failover and ApplicationSet-based applications continue running on both clusters, apply the following workaround:
From the hub cluster, delete the manifestwork resource that continues to run the
ApplicationSet-based applications on the cluster from which the applications were failed over.
oc delete manifestwork -n rackm14 app-busybox-cephfs-1-rackm14-48b8c