Cleanup and data sync for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post failover
Troubleshoot DR policies that protect all applications in the same namespace.
- Problem
-
ApplicationSet based workload deployments to managed clusters are not garbage collected in cases when the hub cluster fails. It is recovered to a standby hub cluster, while the workload has been failed over to a surviving managed cluster. The cluster that the workload was failed over from, rejoins the new recovered standby hub.
ApplicationSets that are DR protected, with a regional DRPolicy, hence starts firing the
VolumeSynchronizationDelay
alert. Further such DR protected workloads cannot be failed over to the peer cluster or relocated to the peer cluster as data is out of sync between the two clusters. - Resolution
The workaround requires that
openshift-gitops
operators can own the workload resources that are orphaned on the managed cluster that rejoined the hub post a failover of the workload was performed from the new recovered hub. To achieve this the following steps can be taken:
- Determine the Placement that is in use by the ArgoCD ApplicationSet resource on the hub cluster
in the
openshift-gitops
namespace. - Inspect the placement label value for the ApplicationSet in this field:
spec.generators.clusterDecisionResource.labelSelector.matchLabels
.This would be the name of the Placement resource <placement-name>
- Ensure that there exists a
PlacemenDecision
for the ApplicationSet referencedPlacement
.oc get placementdecision -n openshift-gitops --selector cluster.open-cluster-management.io/placement=<placement-name>
This results in a single
PlacementDecision
that places the workload in the currently desired failover cluster. -
Create a new
PlacementDecision
for the ApplicationSet pointing to the cluster where it should be cleaned up.For example:
apiVersion: cluster.open-cluster-management.io/v1beta1 kind: PlacementDecision metadata: labels: cluster.open-cluster-management.io/decision-group-index: "1" # Typically one higher than the same value in the esisting PlacementDecision determined at step (2) cluster.open-cluster-management.io/decision-group-name: "" cluster.open-cluster-management.io/placement: cephfs-appset-busybox10-placement name: <placemen-name>-decision-<n> # <n> should be one higher than the existing PlacementDecision as determined in step (2) namespace: openshift-gitops
- Update the newly created
PlacementDecision
with a status subresource.decision-status.yaml: status: decisions: - clusterName: <managedcluster-name-to-clean-up> # This would be the cluster from where the workload was failed over, NOT the current workload cluster reason: FailoverCleanup
oc patch placementdecision -n openshift-gitops <placemen-name>-decision-<n> --patch-file=decision-status.yaml --subresource=status --type=merge
- Watch and ensure that the Application resource for the ApplicationSet has been placed on the
desired
cluster.
oc get application -n openshift-gitops <applicationset-name>-<managedcluster-name-to-clean-up>
In the output, check if the SYNC STATUS shows as
Synced
and the HEALTH STATUS shows asHealthy
. - Delete the PlacementDecision that was created in step (3), such that ArgoCD can garbage collect
the workload resources on the
<managedcluster-name-to-clean-up>
oc delete placementdecision -n openshift-gitops <placemen-name>-decision-<n>
ApplicationSets that are DR protected, with a regional DRPolicy, stops firing the
VolumeSynchronizationDelay
alert.- Determine the Placement that is in use by the ArgoCD ApplicationSet resource on the hub cluster
in the
BZ reference: [2268594]