Metro-DR data fencing
Before failover of applications to surviving site, do the following Metro-DR Data Fencing.
About this task
Note: Fencing is needed only in case of unplanned failover.
Procedure
- Designate two new additional Scale quormn nodes.
- Add label scale.spectrum.ibm.com/designation=quorum for any two
worker
nodes.
For example:oc label node <node> scale.spectrum.ibm.com/designation=quorum
oc label node compute-1-ru6.rackae2.mydomain.com scale.spectrum.ibm.com/designation=quorum node/compute-1-ru6.rackae2.mydomain.com labeled
- Verify the node
status.
Sample output:oc get nodes -l scale.spectrum.ibm.com/designation=quorum | grep compute
NAME STATUS ROLES AGE VERSION compute-1-ru6.rackae2.mydomain.com Ready worker 14d v1.23.17+16bcd69 compute-1-ru7.rackae2.mydomain.com Ready worker 14d v1.23.17+16bcd69
- Uninstall submariner from site-2 to disrupt admin network connectivity across
sites. Submariner goes down due to unavailability of the other site.
- Edit
mni
CR and add uninstall toinstallSubmariner
. When other site comes up, it gets uninstalled. This step avoids network connection to establish between the sites accidentally.oc edit mni metrodr-network
Run the following command to check the edited YAML.oc get mni metrodr-network -oyaml -n ibm-spectrum-fusion-ns
Example YAML:apiVersion: network.isf.ibm.com/v1 kind: MetroDRNetworkInstall metadata: creationTimestamp: "2023-07-10T06:11:53Z" finalizers: - metrodrnetworkinstall.network.isf.ibm.com/finalizer generation: 6 name: metrodr-network namespace: ibm-spectrum-fusion-ns ownerReferences: - apiVersion: metrodr.isf.ibm.com/v1 blockOwnerDeletion: true controller: true kind: MetroDR name: metrodrsite uid: b5501ab2-4865-4866-aa78-1ef8ca5199f1 resourceVersion: "39680914" uid: 7a2b39e6-9d21-4f8c-9bc2-4e6ce4139415 spec: installSubmariner: uninstall vlanAdditionRequeDuration: 2m
- Modify network definition in the network CR on the surviving site.
- Take the backup of
networks.operator.openshift.io/cluster
.oc get networks.operator.openshift.io/cluster -oyaml >/tmp/networks.operator.openshift.io_cluster.yaml
- Edit and remove
daemon-network
entries.oc edit networks.operator.openshift.io/cluster
- Delete the following route entries pointing to the other site:
name: daemon-network namespace: ibm-spectrum-scale rawCNIConfig: '{ "cniVersion": "0.3.1", "name": "daemon-network", "type": "bridge", "bridge": "br3201", "mtu": 9000, "ipam": { "type": "static","routes": [{ "dst": "192.168.192.0/18", "gw": "192.168.128.1"}, { "dst": "192.168.192.200/32", "gw": "192.168.128.1"}]}}' type: Raw
- Restart the scale pods for the changes to take effect. Note: Check that the Scale PODs are restarted after this step and make a note.
- Do not break route to Tiebreaker. Run the following commands to validate whether the daemon connectivity to Tie-breaker is available.
Example output:oc project ibm-spectrum-scale oc get po
NAME READY STATUS RESTARTS AGE compute-1-ru5 2/2 Running 2 14d compute-1-ru6 2/2 Running 9 (18h ago) 14d compute-1-ru7 2/2 Running 9 (18h ago) 14d control-1-ru2 2/2 Running 2 14d control-1-ru3 2/2 Running 2 14d control-1-ru4 2/2 Running 2 14d
Example output:oc rsh compute-1-ru5 //any Running node) mmgetstate -a
Node number Node name GPFS state 13 control-1-ru2.daemon.ibm-spectrum-scale.stg.rackae1 unknown 2 control-1-ru3.daemon.ibm-spectrum-scale.stg.rackae1 unknown 3 control-1-ru4.daemon.ibm-spectrum-scale.stg.rackae1 unknown 4 compute-1-ru5.daemon.ibm-spectrum-scale.stg.rackae1 unknown 5 compute-1-ru6.daemon.ibm-spectrum-scale.stg.rackae1 unknown 6 compute-1-ru7.daemon.ibm-spectrum-scale.stg.rackae1 unknown 7 compute-1-ru5.daemon.ibm-spectrum-scale.stg.rackae2 active 8 compute-1-ru6.daemon.ibm-spectrum-scale.stg.rackae2 active 9 compute-1-ru7.daemon.ibm-spectrum-scale.stg.rackae2 active 10 control-1-ru2.daemon.ibm-spectrum-scale.stg.rackae2 active 11 control-1-ru3.daemon.ibm-spectrum-scale.stg.rackae2 active 12 control-1-ru4.daemon.ibm-spectrum-scale.stg.rackae2 active 13 gpfs-tiebreaker active
Example output:ping gpfs-tiebreaker
PING gpfs-tiebreaker (192.168.192.200) 56(84) bytes of data. 64 bytes from gpfs-tiebreaker (192.168.192.200): icmp_seq=1 ttl=64 time=0.239 ms 64 bytes from gpfs-tiebreaker (192.168.192.200): icmp_seq=2 ttl=64 time=0.218 ms
- Click Exit.
- Take the backup of
- Check for the presence of static routes on each scale CORE POD. If they are present, then
manually remove entry on each scale CORE PODs to immediately break daemon connection to the failed
site.
- Check route entries on each scale pod
Example output:oc project ibm-spectrum-scale oc get po
NAME READY STATUS RESTARTS AGE compute-1-ru5 2/2 Running 2 14d compute-1-ru6 2/2 Running 9 (18h ago) 14d compute-1-ru7 2/2 Running 9 (18h ago) 14d control-1-ru2 2/2 Running 2 14d control-1-ru3 2/2 Running 2 14d control-1-ru4 2/2 Running 2 14d
oc rsh compute-1-ru5 //any Running node
- Check existing routes.
route -n
Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.135.0.1 0.0.0.0 UG 0 0 0 eth0 10.135.0.0 0.0.0.0 255.255.254.0 U 0 0 0 eth0 192.168.128.0 192.168.192.1 255.255.192.0 UG 0 0 0 net1 192.168.192.0 0.0.0.0 255.255.192.0 U 0 0 0 net1 192.168.192.200 192.168.192.1 255.255.255.255 UGH 0 0 0 net1
- Delete route with daemon network between two sites.
route del -net 192.168.128.0 gw 192.168.192.1 netmask 255.255.192.0
- Ensure that the route is
deleted.
Example output:route -n
Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.135.0.1 0.0.0.0 UG 0 0 0 eth0 10.135.0.0 0.0.0.0 255.255.254.0 U 0 0 0 eth0 192.168.192.0 0.0.0.0 255.255.192.0 U 0 0 0 net1 192.168.192.200 192.168.192.1 255.255.255.255 UGH 0 0 0 net1
- Click Exit.
- Repeat steps from 7.a to 7.d for all the compute nodes.
- Check route entries on each scale pod
- Capture a Scale must-gather for any future reference.
- Deploy the application on the surviving site.
- Check whether the application deployment works correctly.
- Failover applications and wait for its completion.
- Recover the failed site.
- Delete failed over VMs on the failed site.
- Remove the added label
scale.spectrum.ibm.com/designation=quorum
for the two compute nodes.
Sample output:oc get nodes -l scale.spectrum.ibm.com/designation=quorum | grep compute
NAME STATUS ROLES AGE VERSION compute-1-ru6.rackae2.mydomain.com Ready worker 14d v1.23.17+16bcd69 compute-1-ru7.rackae2.mydomain.com Ready worker 14d v1.23.17+16bcd69 Run the command to remove labels added oc label node compute-1-ru6.rackae2.mydomain.com scale.spectrum.ibm.com/designation- oc label node compute-1-ru7.rackae2.mydomain.com scale.spectrum.ibm.com/designation-
- Verify whether the labels are removed from the compute nodes.
oc get nodes -l scale.spectrum.ibm.com/designation=quorum | grep compute
- Scale down deployments and delete VMs of all failed-over applications or VMs on the
recovered site:
- Check for application deployments under each
namespace:
Example command:oc get deployments --namespace=<namespace>
oc get deployments --namespace=wordpressapp1-ns
NAME READY UP-TO-DATE AVAILABLE AGE Wordpressapp1 0/0 0 0 21h wordpressapp1-mysql 0/0 0 0 21h
- Scale down application deployments to 0.
oc scale deployments/wordpressapp1 --namespace=wordpressapp1-ns --replicas 0 oc scale deployments/wordpressapp1-mysql --namespace=wordpressapp1-ns --replicas 0
- Run the following command to delete the VMs.
oc delete vm -n myvmnamespace
- Check for application deployments under each
namespace:
- Reapply network configurations on the surviving site.
- Re-install submariner.
- Get the status of
Submariner.
Sample YAML:oc get mni metrodr-network -oyaml -n ibm-spectrum-fusion-ns
apiVersion: network.isf.ibm.com/v1 .. reason: LastReconcileCycleSucceded status: "True" type: Available networkInstallStatus: messageCode: message: Submariner uninstalled successfully. progressPercentage: 100
- Edit Submariner CR.
oc edit mni metrodr-network
Spec: installSubmariner: install
- Check installation
status:
Output:oc get mni metrodr-network -oyaml -n ibm-spectrum-fusion-ns
message: Submariner installation completed successfully
- Get the status of
Submariner.
- Add the following entries as follows:
- Run the following
oc
command.oc edit networks.operator.openshift.io/cluster
- Copy the
daemon-network
entries ( removed with step 6.a ) from the backup file.For example:name: daemon-network namespace: ibm-spectrum-scale rawCNIConfig: '{ "cniVersion": "0.3.1", "name": "daemon-network", "type": "bridge", "bridge": "br3201", "mtu": 9000, "ipam": { "type": "static","routes": [{ "dst": "192.168.192.0/18", "gw": "192.168.128.1"}, { "dst": "192.168.192.200/32", "gw": "192.168.128.1"}]}}' type: Raw
- Run the following command to check whether
daemon-network
entries are added correctly.oc get networks.operator.openshift.io/cluster -oyaml
- Restart the scale core pods to update the configuration.
- Run the following
- Check if the routes are present on scale code pods, else add route entries for each
scale pod.
Sample output:oc project ibm-spectrum-scale oc get po
NAME READY STATUS RESTARTS AGE compute-1-ru5 2/2 Running 2 14d compute-1-ru6 2/2 Running 9 (18h ago) 14d compute-1-ru7 2/2 Running 9 (18h ago) 14d control-1-ru2 2/2 Running 2 14d control-1-ru3 2/2 Running 2 14d control-1-ru4 2/2 Running 2 14d
Sample output:oc rsh compute-1-ru5 //any Running node route add -net 192.168.128.0 gw 192.168.192.1 netmask 255.255.192.0 route -n
Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.135.0.1 0.0.0.0 UG 0 0 0 eth0 10.135.0.0 0.0.0.0 255.255.254.0 U 0 0 0 eth0 192.168.128.0 192.168.192.1 255.255.192.0 UG 0 0 0 net1 192.168.192.0 0.0.0.0 255.255.192.0 U 0 0 0 net1 192.168.192.200 192.168.192.1 255.255.255.255 UGH 0 0 0 net1
- Run the following command to check whether the routes are present.
If routes are not present, then add the routes and verify using the following commands.route -n
route add -net 192.168.128.0 gw 192.168.192.1 netmask 255.255.192.0 route -n
- Check the scale pod status. Note that the status must be active.
Sample output:oc project ibm-spectrum-scale oc get po
NAME READY STATUS RESTARTS AGE compute-1-ru5 2/2 Running 2 14d compute-1-ru6 2/2 Running 9 (18h ago) 14d compute-1-ru7 2/2 Running 9 (18h ago) 14d control-1-ru2 2/2 Running 2 14d control-1-ru3 2/2 Running 2 14d control-1-ru4 2/2 Running 2 14d
Sample output:oc rsh compute-1-ru5 //any Running node mmgetstate -a
Node number Node name GPFS state ------------------------------------------------------------------------------- 1 control-1-ru2.daemon.ibm-spectrum-scale.stg.rackae1 active 2 control-1-ru3.daemon.ibm-spectrum-scale.stg.rackae1 active 3 control-1-ru4.daemon.ibm-spectrum-scale.stg.rackae1 active 4 compute-1-ru5.daemon.ibm-spectrum-scale.stg.rackae1 active 5 compute-1-ru6.daemon.ibm-spectrum-scale.stg.rackae1 active 6 compute-1-ru7.daemon.ibm-spectrum-scale.stg.rackae1 active 7 compute-1-ru5.daemon.ibm-spectrum-scale.stg.rackae2 active 8 compute-1-ru6.daemon.ibm-spectrum-scale.stg.rackae2 active 9 compute-1-ru7.daemon.ibm-spectrum-scale.stg.rackae2 active 10 control-1-ru2.daemon.ibm-spectrum-scale.stg.rackae2 active 11 control-1-ru3.daemon.ibm-spectrum-scale.stg.rackae2 active 12 control-1-ru4.daemon.ibm-spectrum-scale.stg.rackae2 active 13 gpfs-tiebreaker active
- Re-install submariner.