Metro-DR issues
Use these troubleshooting information to resolve issues when you work with Metro-DR.
For issues related to the upgrade of Metro-DR setup, see Installation and upgrade issues.
Failback issues
- Problem statement
- During unplanned failover, some applications can be in a "replication error" state after recovery of the failed site with fencing.
- Resolution
-
- If the application failover is not successful, then check for the
volumeattachments
of the application by using the following command.
If theoc get volumeattachment -n <namespace name> | grep < pvc name of given application>
volumeattachments
exist, then try to delete by using the following command.oc delete volumeattachment -n < namespace name>
- If the application failover is not successful, then check for the
IBM Storage Fusion operator status is pending with errors
- Problem statement
- The IBM Storage Fusion operator status is pending with
the following errors in 2.7.2 version:
- RequirementsNotMe
- asyncreplications.dataprotection.isf.ibm.com
- Resolution
- Apply the following CRD YAML:
apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: annotations: controller-gen.kubebuilder.io/version: v0.8.0 creationTimestamp: null name: asyncreplications.dataprotection.isf.ibm.com spec: group: dataprotection.isf.ibm.com names: kind: AsyncReplication listKind: AsyncReplicationList plural: asyncreplications singular: asyncreplication scope: Namespaced versions: - name: v1 schema: openAPIV3Schema: description: AsyncReplication is the Schema for the asyncreplications API properties: apiVersion: description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources' type: string kind: description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds' type: string metadata: type: object spec: description: AsyncReplicationSpec defines the desired state of AsyncReplication properties: consistencyGroup: description: Foo is an example field of AsyncReplication. Edit asyncreplication_types.go to remove/update type: string recoveryPointObjective: type: string remoteSite: type: string targetRole: type: string type: object status: description: AsyncReplicationStatus defines the observed state of AsyncReplication properties: currentRole: description: 'INSERT ADDITIONAL STATUS FIELD - define observed state of cluster Important: Run "make" to regenerate code after modifying this file' type: string phase: type: string type: object type: object served: true storage: true subresources: status: {} status: acceptedNames: kind: "" plural: "" conditions: [] storedVersions: []
Global Data Platform upgrade fails in site 1
- Problem statement
-
- Whenever you do the following upgrades on site 1 of the Metro-DR, the "Global Data platform service
not active on all the nodes" error message gets displayed:
- IBM Storage Fusion 2.7.2 to 2.8.0
- Global Data Platform 5.19.0 to 5.2.0
- When you upgrade IBM Storage Fusion 2.7.2 to 2.8.0, the Global Data Platform upgrade of 5.1.9.0 to 5.2.0.0 of Metro-DR Site 1 gets stuck with error "ECE service not active on all the nodes".
- Whenever you do the following upgrades on site 1 of the Metro-DR, the "Global Data platform service
not active on all the nodes" error message gets displayed:
- Resolution
- Run the following command to patch Scale recovery group with 'Active'
condition:
oc patch recoverygroup rg201 -n ibm-spectrum-scale --type=merge --subresource status --type='json' -p '[{ "op": "add", "path": "/status/conditions/-","value": { "type": "Active", "status": "True", "message": "The recovery group is active", "reason": "Active", "lastTransitionTime":"'"`date +%FT%T`Z"'"}}]'
Less descriptor disks
- Problem statement
- The site monitoring checks whether enough descriptor disks are assigned for each site. For multiple RGs per site, only one RG sees the descriptor disks. The other RG servers send disk_missing that is interpreted as less descriptor disks.
- Resolution
- As a resolution, add the
site_fs_desc_fail
event to the "ignore list" to suppress the entire event. Run the commandmmchconfig mmhealth-events-ignore="site_fs_desc_fail" --force
once on the cluster and then runmmsysmoncontrol restart
on the cluster manager node.
Latest tiebreaker version
After the upgrade of tiebreaker to the latest version, Metro-DR CR does not show the latest version. You can ignore this as it is a known issue.
Application removals fail in the local tab of failed site after recovery or reconnect
You can observe this issue only in case of fail over of fail back of applications on the surviving site. You can ignore this issue as the CR status is correct, and further relocation or fail back of these applications happen properly.
Add and Remove DR does not work as expected when a remote site is down
When multiple DR protected applications exist in the local site and the other site goes down, the add and remove operations of disaster recovery takes longer than usual.
Connection setup after OpenShift Container Platform cluster recovery
- Problem statement
- The OpenShift® Container Platform cluster can have problems and become unusable.
- Resolution
- After you recover the cluster, rejoin the connections. For the steps to clean the connection and setup the connection between two clusters again, see Connection setup after OpenShift Container Platform cluster recovery.
IBM Storage Fusion HCI System installation hangs because of disaster recovery
- Resolution
- If minio pod does not come up for a long time and the Metro-DR installation gets stuck even after the
minio pod goes to the running state, then restart Metro-DR deployment in
ibm-spectrum-fusion-ns
(or your Fusion namespace) to trigger reconciliation.
RamenDR operator crashes after the restoration of a failed site
- Resolution
- If RamenDR operator crashes after the restoration of a failed site, then apply the following workaround:
- Retrieve
s3CompatibleEndpoint
froms3StoreProfiles
of the local site. - Retrieve the secret of local site that is specified in
s3SecretRef
ofs3StoreProfiles
. - Run the following command from
isf-minio
pod inibm-spectrum-fusion-ns
namespace to connect to the local Minio server.mc alias set myminio <s3CompatibleEndpoint retrieved in step 1> <AWS_ACCESS_KEY_ID parameter of secret retrieved in step 1> < AWS_SECRET_ACCESS_KEY parameter of secret retrieved in step 1>
- Run the following command to find out all zero-bytes files in Minio
store:
mc ls myminio --recursive --insecure | grep -i VolumeReplicationGroup | grep 0B
- Run the following command to delete the zero-bytes file that is retrieved in step
3:
mc rm <path of files retrieved in step 3>
- Restart
ramen-dr-cluster-operator
deployment in theibm-spectrum-fusion-ns
namespace.
- Go to Red Hat® OpenShift Container Platform web management console.
- Go to .
- Edit YAMLs of selected
Volumereplicationgroups
and changespec.replicationState = secondary
. - Save YAMLs.
- Retrieve
Failovers of concurrent applications take a long time after network recovery
- Problem statement
- During Disaster Recovery scenario, when either of the Metro-DR site is powered Off or network is disconnected.
- Resolution
- As a resolution, run the following steps before you initiate failover of applications on the
surviving site:
- In Red Hat OpenShift Container Platform console, go to .
- Open the
ramen-dr-cluster-operator-config
configmap in theibm-spectrum-fusion-ns
namespace. - Add MaxConcurrentReconcile parameter in this configmap with value as 50.
- Save the changes and restart
ramen-dr-cluster-operator
deployment in theibm-spectrum-fusion-ns
namespace.
Connection between IBM Storage Scale pods
- Problem statement
- Unable to ping between IBM Storage Scale pods
after you reestablish the network connection between site 1 and site 2. Workaround:
- Change in the CR
Spec
of theinstallSubmariner
field touninstall
.oc edit mni Spec: installSubmariner: uninstall
- Delete the namespace, if it is not already deleted.
- Change in the CR
Spec
of theinstallSubmariner
field toinstall
.oc edit mni Spec: installSubmariner: install
- If the installation is not successful, then try to install the submariner by using the following
manual steps:
- Label the control nodes as gateways on both sites of
Metro-DR.
oc label node control-0.isf-rackc.rtp.raleigh.ibm.com submariner.io/gateway=true oc label node control-1.isf-rackc.rtp.raleigh.ibm.com submariner.io/gateway=true oc label node control-2.isf-rackc.rtp.raleigh.ibm.com submariner.io/gateway=true
- Log in to MetroDR-operator pod.
[root@hci-dev1 ~]# oc get po | grep metr isf-metrodr-operator-controller-manager-bddffd98-dhvgt 2/2 Running 0 13h [root@hci-dev1 ~]# oc rsh isf-metrodr-operator-controller-manager-bddffd98-dhvgt
- Deploy the broker for
submariner.
subctl deploy-broker --operator-debug --kubeconfig=/tmp/site-1-kubeconfig
- Join to
broker.
subctl join --kubeconfig=/tmp/site-2-kubeconfig --operator-debug --pod-debug --clusterid=site-2 broker-info.subm subctl join --kubeconfig=/tmp/site-1-kubeconfig --operator-debug --pod-debug --clusterid=site-1 broker-info.subm
- Label the control nodes as gateways on both sites of
Metro-DR.
- Change in the CR
Disaster recovery connections in installing state for hours
- Diagnosis
- You might encounter authentication or authorization issues while you connect to the remote
cluster. To confirm that the issue is related to authorization, check whether you see the following
messages in the Metro-DR pod logs:
-
Incorrect token, need to regenerate kubeconfig file
-
UnAuthorized
-
- Cause
-
- If Metro-DR instance displays any of the
following error messages, then do the following
workaround:
oc get mdr -oyaml
“Submariner is not installed”
-
oc get mni -oyaml
“Failed to uninstall Submariner” OR “Submariner installation failed”
- Submariner connectivity might get affected when any of the nodes go to a bad state.
If problems exist in pod connectivity between racks even after all nodes are back in working condition, then do the following workaround.
- If Metro-DR instance displays any of the
following error messages, then do the following
workaround:
- Resolution
-
- Update site 2 password on site 1 and regenerate the kubeconfig file:
- Get the site 2 token from site 2 service account:
[root@hci-dev1 ~]# oc get sa fusion-admin-controller-manager -oyaml secrets: - name: fusion-admin-controller-manager-dockercfg-vmp2m - name: fusion-admin-controller-manager-token-wj96d [root@hci-dev1 ~]# oc get secret fusion-admin-controller-manager-token-wj96d -oyaml data: token: site2Token
- Update the site 2 token in secret
isf-metrodr-config-secret
on site 1:[root@hci-dev1 ~]# oc edit secret isf-metrodr-config-secret Data: Site2KubePassword: site2Token
- Log in to the Metro-DR DR pod on site 1 and
delete the incorrect site 2 kubeconfig file:
root@hci-dev1 ~]# oc get po | grep metro isf-metrodr-operator-controller-manager-7fd8c4fbc-wwkc8 2/2 Running 0 13h [root@hci-dev1 ~]# oc exec -ti isf-metrodr-operator-controller-manager-7fd8c4fbc-wwkc8 -c manager bash kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. bash-4.4$ ls /tmp/ ks-script-c9hd6mir ks-script-jrnjxyf4 site-1-kubeconfig site-2-kubeconfig bash-4.4$ rm -f /tmp/site-2-kubeconfig
- Get the site 2 token from site 2 service account:
- Update site 1 password on site 2 and regenerate the kubeconfig file:
- Get the site 1 token from site 1 service account:
[root@hci-dev1 ~]# oc get sa fusion-admin-controller-manager -oyaml secrets: - name: fusion-admin-controller-manager-dockercfg-vmp2m - name: fusion-admin-controller-manager-token-wj96d [root@hci-dev1 ~]# oc get secret fusion-admin-controller-manager-token-wj96d -oyaml data: token: site1Token
- Update the site 1 secret token in
isf-metrodr-config-secret
on site 2:[root@hci-dev1 ~]# oc edit secret isf-metrodr-config-secret Data: Site1KubePassword: site1Token
- Log in to the Metro-DR pod on site 2 and
delete the incorrect site 1 kubeconfig file:
root@hci-dev1 ~]# oc get po | grep metro isf-metrodr-operator-controller-manager-7fd8c4fbc-wwkc8 2/2 Running 0 13h [root@hci-dev1 ~]# oc exec -ti isf-metrodr-operator-controller-manager-7fd8c4fbc-wwkc8 -c manager bash kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. bash-4.4$ ls /tmp/ ks-script-c9hd6mir ks-script-jrnjxyf4 site-1-kubeconfig site-2-kubeconfig bash-4.4$ rm -f /tmp/site-1-kubeconfig
- Get the site 1 token from site 1 service account:
- Update site 2 password on site 1 and regenerate the kubeconfig file:
Issues with more than 3 quorum nodes on a cluster
- Problem statement
- The following error occurs only when the OpenShift Container Platform contains more than 10 nodes for the initial
installation in a Metro-DR:
mmchnode: The number of quorum nodes exceeds the maximum (8) allowed.
- Cause
- The Scale works fine when both sites are healthy. As the quorum nodes are not balanced on both sites, when the site that has five quorum nodes goes down, the whole scale cluster goes down.
- Resolution
-
- Run the following command to get the quorum nodes:
oc get nodes -l scale.spectrum.ibm.com/designation=quorum
- Run the following command to remove the label from two non-control nodes:
oc label nodes <node_name> scale.spectrum.ibm.com/designation-
- Run the following command to get the quorum nodes:
Ramen DR CR goes into Crashloopbackoff
state when enabling DR for
applications
- Problem statement
- The Ramen DR pod goes into
Crashloopbackoff
state with anOOMKilled
error during enabling DR for applications.
- Cause
- This issue might occur due to memory issue during the upgrade from IBM Storage Fusion version from 2.8.0 to 2.8.1.
- Resolution
- As a workaround, increase the memory values in the deployment configuration of the Ramen DR
pod.From:
- resources: limits: cpu: 100m memory: 300Mi requests: cpu: 100m memory: 200Mi
to:- resources: limits: cpu: 100m memory: 600Mi requests: cpu: 100m memory: 500Mi
Metro-DR goes into a critical state after scale up or scale down operation
- Problem statement
- The Metro-DR pod goes into critical state critical state after scale up or scale down operation.
- Resolution
- Follow the steps on the sites to resolve this issue:
- Log in to the OpenShift Container Platform web console.
- Go to .
- Click Filesystem CR.
- Go to YAML tab and copy the specification details.
- Similarly, check the RecoveryGroup CR and note the available RVGs
dedicated for the sites.
For example, rg1 and rg201 for the primary site and rg2 and rg202 for the secondary site.
- Check the number of NVME disks attached to the storage nodes on both the sites. You can also
execute
lsblk
command on the storage nodes to get the NVME disk count. - Calculate the Vdisk Size.
If there are two disks, then it is 100%
If there are four disks, then it is 500%
If there are six disks, then it is 33%
If there are eight disks, then it is 25%.
- Add the respective entries for the recovery group.
For example, the entry order must be rg1, rg2, rg201 and rg202.
- Check whether the Vdisk array count for the respective RVG should be half of the disk count. ex.
Suppose
For example:
If there are two disks, then the Vdisk array should appear only once
If there are four disks, then the Vdisk array should appear twice.
- Make sure to adjust the Vdisk set array for the site RVG and create a set.
- Add another site RVG entries as it is in the
Filesystem
CR spec. - Save the
Filesystem
CR.