Cleanup and data sync for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post failover

Troubleshoot DR policies that protect all applications in the same namespace.

Problem

ApplicationSet based workload deployments to managed clusters are not garbage collected in cases when the hub cluster fails. It is recovered to a standby hub cluster, while the workload has been failed over to a surviving managed cluster. The cluster that the workload was failed over from, rejoins the new recovered standby hub.

ApplicationSets that are DR protected, with a regional DRPolicy, hence starts firing the VolumeSynchronizationDelay alert. Further such DR protected workloads cannot be failed over to the peer cluster or relocated to the peer cluster as data is out of sync between the two clusters.

Resolution

The workaround requires that

openshift-gitops

operators can own the workload resources that are orphaned on the managed cluster that rejoined the hub post a failover of the workload was performed from the new recovered hub. To achieve this the following steps can be taken:

  1. Determine the Placement that is in use by the ArgoCD ApplicationSet resource on the hub cluster in the openshift-gitops namespace.
  2. Inspect the placement label value for the ApplicationSet in this field: spec.generators.clusterDecisionResource.labelSelector.matchLabels.

    This would be the name of the Placement resource <placement-name>

  3. Ensure that there exists a PlacemenDecision for the ApplicationSet referenced Placement.
    oc get placementdecision -n openshift-gitops --selector cluster.open-cluster-management.io/placement=<placement-name>

    This results in a single PlacementDecision that places the workload in the currently desired failover cluster.

  4. Create a new PlacementDecision for the ApplicationSet pointing to the cluster where it should be cleaned up.

    For example:

    apiVersion: cluster.open-cluster-management.io/v1beta1
    kind: PlacementDecision
    metadata:
      labels:
        cluster.open-cluster-management.io/decision-group-index: "1" # Typically one higher than the same value in the esisting PlacementDecision determined at step (2)
        cluster.open-cluster-management.io/decision-group-name: ""
        cluster.open-cluster-management.io/placement: cephfs-appset-busybox10-placement
      name: <placemen-name>-decision-<n> # <n> should be one higher than the existing PlacementDecision as determined in step (2)
      namespace: openshift-gitops
  5. Update the newly created PlacementDecision with a status subresource.
    decision-status.yaml:
    status:
      decisions:
      - clusterName: <managedcluster-name-to-clean-up> # This would be the cluster from where the workload was failed over, NOT the current workload cluster
        reason: FailoverCleanup
    oc patch placementdecision -n openshift-gitops <placemen-name>-decision-<n> --patch-file=decision-status.yaml --subresource=status --type=merge
  6. Watch and ensure that the Application resource for the ApplicationSet has been placed on the desired cluster.
    oc get application -n openshift-gitops  <applicationset-name>-<managedcluster-name-to-clean-up>

    In the output, check if the SYNC STATUS shows as Synced and the HEALTH STATUS shows as Healthy.

  7. Delete the PlacementDecision that was created in step (3), such that ArgoCD can garbage collect the workload resources on the <managedcluster-name-to-clean-up>
    oc delete placementdecision -n openshift-gitops <placemen-name>-decision-<n>

ApplicationSets that are DR protected, with a regional DRPolicy, stops firing the VolumeSynchronizationDelay alert.

BZ reference: [2268594]