Metro-DR issues

Use these troubleshooting information to resolve issues when you work with Metro-DR.

For issues related to the upgrade of Metro-DR setup, see Installation and upgrade issues.

Failback issues

Problem statement
During unplanned failover, some applications can be in a "replication error" state after recovery of the failed site with fencing.
Resolution
  • If the application failover is not successful, then check for the volumeattachments of the application by using the following command.
    oc get volumeattachment -n <namespace name> | grep < pvc name of given application>
    If the volumeattachments exist, then try to delete by using the following command.
    oc delete volumeattachment -n < namespace name>

IBM Storage Fusion operator status is pending with errors

Problem statement
The IBM Storage Fusion operator status is pending with the following errors in 2.7.2 version:
  • RequirementsNotMe
  • asyncreplications.dataprotection.isf.ibm.com
Resolution
Apply the following CRD YAML:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.8.0
  creationTimestamp: null
  name: asyncreplications.dataprotection.isf.ibm.com
spec:
  group: dataprotection.isf.ibm.com
  names:
    kind: AsyncReplication
    listKind: AsyncReplicationList
    plural: asyncreplications
    singular: asyncreplication
  scope: Namespaced
  versions:
  - name: v1
    schema:
      openAPIV3Schema:
        description: AsyncReplication is the Schema for the asyncreplications API
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          spec:
            description: AsyncReplicationSpec defines the desired state of AsyncReplication
            properties:
              consistencyGroup:
                description: Foo is an example field of AsyncReplication. Edit asyncreplication_types.go
                  to remove/update
                type: string
              recoveryPointObjective:
                type: string
              remoteSite:
                type: string
              targetRole:
                type: string
            type: object
          status:
            description: AsyncReplicationStatus defines the observed state of AsyncReplication
            properties:
              currentRole:
                description: 'INSERT ADDITIONAL STATUS FIELD - define observed state
                  of cluster Important: Run "make" to regenerate code after modifying
                  this file'
                type: string
              phase:
                type: string
            type: object
        type: object
    served: true
    storage: true
    subresources:
      status: {}
status:
  acceptedNames:
    kind: ""
    plural: ""
  conditions: []
  storedVersions: []

Global Data Platform upgrade fails in site 1

Problem statement
  • Whenever you do the following upgrades on site 1 of the Metro-DR, the "Global Data platform service not active on all the nodes" error message gets displayed:
    • IBM Storage Fusion 2.7.2 to 2.8.0
    • Global Data Platform 5.19.0 to 5.2.0
  • When you upgrade IBM Storage Fusion 2.7.2 to 2.8.0, the Global Data Platform upgrade of 5.1.9.0 to 5.2.0.0 of Metro-DR Site 1 gets stuck with error "ECE service not active on all the nodes".
Resolution
Run the following command to patch Scale recovery group with 'Active' condition:
oc patch recoverygroup rg201 -n ibm-spectrum-scale --type=merge --subresource status --type='json' -p '[{ "op": "add", "path": "/status/conditions/-","value": { "type": "Active", "status": "True", "message": "The recovery group is active", "reason": "Active", "lastTransitionTime":"'"`date +%FT%T`Z"'"}}]'

Less descriptor disks

Problem statement
The site monitoring checks whether enough descriptor disks are assigned for each site. For multiple RGs per site, only one RG sees the descriptor disks. The other RG servers send disk_missing that is interpreted as less descriptor disks.
Resolution
As a resolution, add the site_fs_desc_fail event to the "ignore list" to suppress the entire event. Run the command mmchconfig mmhealth-events-ignore="site_fs_desc_fail" --force once on the cluster and then run mmsysmoncontrol restart on the cluster manager node.

Latest tiebreaker version

After the upgrade of tiebreaker to the latest version, Metro-DR CR does not show the latest version. You can ignore this as it is a known issue.

Application removals fail in the local tab of failed site after recovery or reconnect

You can observe this issue only in case of fail over of fail back of applications on the surviving site. You can ignore this issue as the CR status is correct, and further relocation or fail back of these applications happen properly.

Add and Remove DR does not work as expected when a remote site is down

When multiple DR protected applications exist in the local site and the other site goes down, the add and remove operations of disaster recovery takes longer than usual.

Connection setup after OpenShift Container Platform cluster recovery

Problem statement
The OpenShift® Container Platform cluster can have problems and become unusable.
Resolution
After you recover the cluster, rejoin the connections. For the steps to clean the connection and setup the connection between two clusters again, see Connection setup after OpenShift Container Platform cluster recovery.

IBM Storage Fusion HCI System installation hangs because of disaster recovery

Resolution
If minio pod does not come up for a long time and the Metro-DR installation gets stuck even after the minio pod goes to the running state, then restart Metro-DR deployment in ibm-spectrum-fusion-ns (or your Fusion namespace) to trigger reconciliation.

RamenDR operator crashes after the restoration of a failed site

Resolution
If RamenDR operator crashes after the restoration of a failed site, then apply the following workaround:
  1. Retrieve s3CompatibleEndpoint from s3StoreProfiles of the local site.
  2. Retrieve the secret of local site that is specified in s3SecretRef of s3StoreProfiles.
  3. Run the following command from isf-minio pod in ibm-spectrum-fusion-ns namespace to connect to the local Minio server.
    mc alias set myminio <s3CompatibleEndpoint retrieved in step 1> <AWS_ACCESS_KEY_ID parameter of secret retrieved in step 1> < AWS_SECRET_ACCESS_KEY parameter of secret retrieved in step 1>
  4. Run the following command to find out all zero-bytes files in Minio store:
    mc ls myminio --recursive --insecure | grep -i VolumeReplicationGroup | grep 0B
  5. Run the following command to delete the zero-bytes file that is retrieved in step 3:
    mc rm <path of files retrieved in step 3>
  6. Restart ramen-dr-cluster-operator deployment in the ibm-spectrum-fusion-ns namespace.
If the Volumereplicationgroup's ReplicationState of the application fail over does not change to secondary after the restoration of a failed site, then do the following workaround steps:
  1. Go to Red Hat® OpenShift Container Platform web management console.
  2. Go to CustomResourceDefinitions > volumereplicationgroups.ramendr.openshift.io > instances.
  3. Edit YAMLs of selected Volumereplicationgroups and change spec.replicationState = secondary.
  4. Save YAMLs.

Failovers of concurrent applications take a long time after network recovery

Problem statement
During Disaster Recovery scenario, when either of the Metro-DR site is powered Off or network is disconnected.
Resolution
As a resolution, run the following steps before you initiate failover of applications on the surviving site:
  1. In Red Hat OpenShift Container Platform console, go to Workloads > ConfigMap.
  2. Open the ramen-dr-cluster-operator-config configmap in the ibm-spectrum-fusion-ns namespace.
  3. Add MaxConcurrentReconcile parameter in this configmap with value as 50.
  4. Save the changes and restart ramen-dr-cluster-operator deployment in the ibm-spectrum-fusion-ns namespace.

Connection between IBM Storage Scale pods

Problem statement
Unable to ping between IBM Storage Scale pods after you reestablish the network connection between site 1 and site 2.
Workaround:
  1. Change in the CR Spec of the installSubmariner field to uninstall.
    oc edit mni
    
    Spec:
        installSubmariner: uninstall
  2. Delete the namespace, if it is not already deleted.
  3. Change in the CR Spec of the installSubmariner field to install.
    oc edit mni
    
    Spec:
        installSubmariner: install
  4. If the installation is not successful, then try to install the submariner by using the following manual steps:
    1. Label the control nodes as gateways on both sites of Metro-DR.
      oc label node control-0.isf-rackc.rtp.raleigh.ibm.com submariner.io/gateway=true
      oc label node control-1.isf-rackc.rtp.raleigh.ibm.com submariner.io/gateway=true
      oc label node control-2.isf-rackc.rtp.raleigh.ibm.com submariner.io/gateway=true
    2. Log in to MetroDR-operator pod.
      [root@hci-dev1 ~]# oc get po | grep metr
      isf-metrodr-operator-controller-manager-bddffd98-dhvgt            2/2     Running   0               13h
      [root@hci-dev1 ~]# oc rsh isf-metrodr-operator-controller-manager-bddffd98-dhvgt
    3. Deploy the broker for submariner.
      subctl deploy-broker  --operator-debug  --kubeconfig=/tmp/site-1-kubeconfig
    4. Join to broker.
      subctl join --kubeconfig=/tmp/site-2-kubeconfig --operator-debug --pod-debug  --clusterid=site-2  broker-info.subm 
      subctl join --kubeconfig=/tmp/site-1-kubeconfig  --operator-debug --pod-debug  --clusterid=site-1 broker-info.subm 

Disaster recovery connections in installing state for hours

Diagnosis
You might encounter authentication or authorization issues while you connect to the remote cluster. To confirm that the issue is related to authorization, check whether you see the following messages in the Metro-DR pod logs:
  • Incorrect token, need to regenerate kubeconfig file
  • UnAuthorized
Cause
  • If Metro-DR instance displays any of the following error messages, then do the following workaround:
    oc get mdr -oyaml
    “Submariner is not installed”
  • oc get mni -oyaml 
    “Failed to uninstall Submariner” 
    OR
    “Submariner installation failed”
  • Submariner connectivity might get affected when any of the nodes go to a bad state.

    If problems exist in pod connectivity between racks even after all nodes are back in working condition, then do the following workaround.

Resolution
  1. Update site 2 password on site 1 and regenerate the kubeconfig file:
    1. Get the site 2 token from site 2 service account:
      [root@hci-dev1 ~]# oc get sa fusion-admin-controller-manager -oyaml
      secrets:
      - name: fusion-admin-controller-manager-dockercfg-vmp2m
      - name: fusion-admin-controller-manager-token-wj96d
      
      
      [root@hci-dev1 ~]# oc get secret fusion-admin-controller-manager-token-wj96d -oyaml
      data:
        token: site2Token 
    2. Update the site 2 token in secret isf-metrodr-config-secret on site 1:
      [root@hci-dev1 ~]# oc edit secret isf-metrodr-config-secret
      Data:
         Site2KubePassword: site2Token
      
    3. Log in to the Metro-DR DR pod on site 1 and delete the incorrect site 2 kubeconfig file:
      root@hci-dev1 ~]# oc get po | grep metro
      isf-metrodr-operator-controller-manager-7fd8c4fbc-wwkc8           2/2     Running             0             13h
      
      [root@hci-dev1 ~]# oc exec -ti isf-metrodr-operator-controller-manager-7fd8c4fbc-wwkc8 -c manager bash
      kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
      bash-4.4$ ls /tmp/
      ks-script-c9hd6mir  ks-script-jrnjxyf4	site-1-kubeconfig  site-2-kubeconfig
      bash-4.4$ rm -f /tmp/site-2-kubeconfig
      
  2. Update site 1 password on site 2 and regenerate the kubeconfig file:
    1. Get the site 1 token from site 1 service account:
      [root@hci-dev1 ~]# oc get sa fusion-admin-controller-manager -oyaml
      secrets:
      - name: fusion-admin-controller-manager-dockercfg-vmp2m
      - name: fusion-admin-controller-manager-token-wj96d
      
      
      [root@hci-dev1 ~]# oc get secret fusion-admin-controller-manager-token-wj96d -oyaml
      data:
        token: site1Token 
    2. Update the site 1 secret token in isf-metrodr-config-secret on site 2:
      [root@hci-dev1 ~]# oc edit secret isf-metrodr-config-secret
      Data:
         Site1KubePassword: site1Token
    3. Log in to the Metro-DR pod on site 2 and delete the incorrect site 1 kubeconfig file:
      root@hci-dev1 ~]# oc get po | grep metro
      isf-metrodr-operator-controller-manager-7fd8c4fbc-wwkc8           2/2     Running             0             13h
      
      [root@hci-dev1 ~]# oc exec -ti isf-metrodr-operator-controller-manager-7fd8c4fbc-wwkc8 -c manager bash
      kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
      bash-4.4$ ls /tmp/
      ks-script-c9hd6mir  ks-script-jrnjxyf4	site-1-kubeconfig  site-2-kubeconfig
      bash-4.4$ rm -f /tmp/site-1-kubeconfig

Issues with more than 3 quorum nodes on a cluster

Problem statement
The following error occurs only when the OpenShift Container Platform contains more than 10 nodes for the initial installation in a Metro-DR:
mmchnode: The number of quorum nodes exceeds the maximum (8) allowed.
Cause
The Scale works fine when both sites are healthy. As the quorum nodes are not balanced on both sites, when the site that has five quorum nodes goes down, the whole scale cluster goes down.
Resolution
  1. Run the following command to get the quorum nodes:
    oc get nodes -l scale.spectrum.ibm.com/designation=quorum
  2. Run the following command to remove the label from two non-control nodes:
    oc label nodes <node_name> scale.spectrum.ibm.com/designation-

Ramen DR CR goes into Crashloopbackoff state when enabling DR for applications

Problem statement
The Ramen DR pod goes into Crashloopbackoff state with an OOMKilled error during enabling DR for applications.
Cause
This issue might occur due to memory issue during the upgrade from IBM Storage Fusion version from 2.8.0 to 2.8.1.
Resolution
As a workaround, increase the memory values in the deployment configuration of the Ramen DR pod.
From:
        - resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 200Mi
to:
    - resources:
        limits:
          cpu: 100m
          memory: 600Mi
        requests:
          cpu: 100m
          memory: 500Mi

Metro-DR goes into a critical state after scale up or scale down operation

Problem statement
The Metro-DR pod goes into critical state critical state after scale up or scale down operation.
Resolution
Follow the steps on the sites to resolve this issue:
  1. Log in to the OpenShift Container Platform web console.
  2. Go to Administration > CustomResourceDefinitions.
  3. Click Filesystem CR.
  4. Go to YAML tab and copy the specification details.
  5. Similarly, check the RecoveryGroup CR and note the available RVGs dedicated for the sites.

    For example, rg1 and rg201 for the primary site and rg2 and rg202 for the secondary site.

  6. Check the number of NVME disks attached to the storage nodes on both the sites. You can also execute lsblk command on the storage nodes to get the NVME disk count.
  7. Calculate the Vdisk Size.

    If there are two disks, then it is 100%

    If there are four disks, then it is 500%

    If there are six disks, then it is 33%

    If there are eight disks, then it is 25%.

  8. Add the respective entries for the recovery group.

    For example, the entry order must be rg1, rg2, rg201 and rg202.

  9. Check whether the Vdisk array count for the respective RVG should be half of the disk count. ex. Suppose

    For example:

    If there are two disks, then the Vdisk array should appear only once

    If there are four disks, then the Vdisk array should appear twice.

  10. Make sure to adjust the Vdisk set array for the site RVG and create a set.
  11. Add another site RVG entries as it is in the Filesystem CR spec.
  12. Save the Filesystem CR.