Metro-DR data fencing

Before failover of applications to surviving site, do the following Metro-DR Data Fencing.

About this task

Note: Fencing is needed only in case of unplanned failover.

Procedure

  1. Designate two new additional Scale quormn nodes.
  2. Add label scale.spectrum.ibm.com/designation=quorum for any two worker nodes.
    oc label node <node> scale.spectrum.ibm.com/designation=quorum
    For example:
    oc label node compute-1-ru6.rackae2.mydomain.com   scale.spectrum.ibm.com/designation=quorum
    node/compute-1-ru6.rackae2.mydomain.com labeled
    
  3. Verify the node status.
    oc get nodes -l scale.spectrum.ibm.com/designation=quorum | grep compute
    Sample output:
    NAME                                 STATUS   ROLES    AGE   VERSION
    compute-1-ru6.rackae2.mydomain.com   Ready    worker   14d   v1.23.17+16bcd69
    compute-1-ru7.rackae2.mydomain.com   Ready    worker   14d   v1.23.17+16bcd69
    
  4. Uninstall submariner from site-2 to disrupt admin network connectivity across sites.
    Submariner goes down due to unavailability of the other site.
  5. Edit mni CR and add uninstall to installSubmariner. When other site comes up, it gets uninstalled. This step avoids network connection to establish between the sites accidentally.
    oc edit mni metrodr-network
    Run the following command to check the edited YAML.
    oc get mni metrodr-network -oyaml -n ibm-spectrum-fusion-ns
    Example YAML:
    
    apiVersion: network.isf.ibm.com/v1
    kind: MetroDRNetworkInstall
    metadata:
      creationTimestamp: "2023-07-10T06:11:53Z"
      finalizers:
      - metrodrnetworkinstall.network.isf.ibm.com/finalizer
      generation: 6
      name: metrodr-network
      namespace: ibm-spectrum-fusion-ns
      ownerReferences:
      - apiVersion: metrodr.isf.ibm.com/v1
        blockOwnerDeletion: true
        controller: true
        kind: MetroDR
        name: metrodrsite
        uid: b5501ab2-4865-4866-aa78-1ef8ca5199f1
      resourceVersion: "39680914"
      uid: 7a2b39e6-9d21-4f8c-9bc2-4e6ce4139415
    spec:
      installSubmariner: uninstall
      vlanAdditionRequeDuration: 2m
    
  6. Modify network definition in the network CR on the surviving site.
    1. Take the backup of networks.operator.openshift.io/cluster.
      oc get networks.operator.openshift.io/cluster -oyaml >/tmp/networks.operator.openshift.io_cluster.yaml
    2. Edit and remove daemon-network entries.
      oc edit networks.operator.openshift.io/cluster
    3. Delete the following route entries pointing to the other site:
      name: daemon-network
      namespace: ibm-spectrum-scale
      rawCNIConfig: '{ "cniVersion": "0.3.1", "name": "daemon-network", "type": "bridge",
      "bridge": "br3201", "mtu": 9000, "ipam": { "type": "static","routes": [{ "dst":
      "192.168.192.0/18", "gw": "192.168.128.1"}, { "dst": "192.168.192.200/32", "gw":
      "192.168.128.1"}]}}'
      type: Raw
    4. Restart the scale pods for the changes to take effect.
      Note: Check that the Scale PODs are restarted after this step and make a note.
    5. Do not break route to Tiebreaker.
      Run the following commands to validate whether the daemon connectivity to Tie-breaker is available.
      
      oc project ibm-spectrum-scale
      oc get po
      Example output:
      NAME                               READY   STATUS    RESTARTS      AGE
      compute-1-ru5                      2/2     Running   2             14d
      compute-1-ru6                      2/2     Running   9 (18h ago)   14d
      compute-1-ru7                      2/2     Running   9 (18h ago)   14d
      control-1-ru2                      2/2     Running   2             14d
      control-1-ru3                      2/2     Running   2             14d
      control-1-ru4                      2/2     Running   2             14d
      
      
      oc rsh compute-1-ru5 //any Running node)
      mmgetstate -a
      Example output:
      Node number  Node name                                            GPFS state  
      13	control-1-ru2.daemon.ibm-spectrum-scale.stg.rackae1  unknown
                 2  control-1-ru3.daemon.ibm-spectrum-scale.stg.rackae1  unknown
                 3  control-1-ru4.daemon.ibm-spectrum-scale.stg.rackae1  unknown
                 4  compute-1-ru5.daemon.ibm-spectrum-scale.stg.rackae1  unknown
                 5  compute-1-ru6.daemon.ibm-spectrum-scale.stg.rackae1  unknown
                 6  compute-1-ru7.daemon.ibm-spectrum-scale.stg.rackae1  unknown
                 7  compute-1-ru5.daemon.ibm-spectrum-scale.stg.rackae2  active
                 8  compute-1-ru6.daemon.ibm-spectrum-scale.stg.rackae2  active
                 9  compute-1-ru7.daemon.ibm-spectrum-scale.stg.rackae2  active
                10  control-1-ru2.daemon.ibm-spectrum-scale.stg.rackae2  active
                11  control-1-ru3.daemon.ibm-spectrum-scale.stg.rackae2  active
                12  control-1-ru4.daemon.ibm-spectrum-scale.stg.rackae2  active
                13  gpfs-tiebreaker                                      active
      
      ping gpfs-tiebreaker
      Example output:
      PING gpfs-tiebreaker (192.168.192.200) 56(84) bytes of data.
      64 bytes from gpfs-tiebreaker (192.168.192.200): icmp_seq=1 ttl=64 time=0.239 ms
      64 bytes from gpfs-tiebreaker (192.168.192.200): icmp_seq=2 ttl=64 time=0.218 ms
      
    6. Click Exit.
  7. Check for the presence of static routes on each scale CORE POD. If they are present, then manually remove entry on each scale CORE PODs to immediately break daemon connection to the failed site.
    1. Check route entries on each scale pod
      
      oc project ibm-spectrum-scale
      oc get po
      Example output:
      NAME                               READY   STATUS    RESTARTS      AGE
      compute-1-ru5                      2/2     Running   2             14d
      compute-1-ru6                      2/2     Running   9 (18h ago)   14d
      compute-1-ru7                      2/2     Running   9 (18h ago)   14d
      control-1-ru2                      2/2     Running   2             14d
      control-1-ru3                      2/2     Running   2             14d
      control-1-ru4                      2/2     Running   2             14d
      
      
      oc rsh compute-1-ru5 //any Running node
    2. Check existing routes.
      route -n
      Kernel IP routing table
      Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
      0.0.0.0         10.135.0.1      0.0.0.0         UG    0      0        0 eth0
      10.135.0.0      0.0.0.0         255.255.254.0   U     0      0        0 eth0
      192.168.128.0   192.168.192.1   255.255.192.0   UG    0      0        0 net1
      192.168.192.0   0.0.0.0         255.255.192.0   U     0      0        0 net1
      192.168.192.200 192.168.192.1   255.255.255.255 UGH   0      0        0 net1
      
    3. Delete route with daemon network between two sites.
      route del -net 192.168.128.0 gw 192.168.192.1 netmask 255.255.192.0
    4. Ensure that the route is deleted.
      route -n
      Example output:
      Kernel IP routing table
      Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
      0.0.0.0         10.135.0.1      0.0.0.0         UG    0      0        0 eth0
      10.135.0.0      0.0.0.0         255.255.254.0   U     0      0        0 eth0
      192.168.192.0   0.0.0.0         255.255.192.0   U     0      0        0 net1
      192.168.192.200 192.168.192.1   255.255.255.255 UGH   0      0        0 net1
      
    5. Click Exit.
    6. Repeat steps from 7.a to 7.d for all the compute nodes.
  8. Capture a Scale must-gather for any future reference.
  9. Deploy the application on the surviving site.
  10. Check whether the application deployment works correctly.
  11. Failover applications and wait for its completion.
  12. Recover the failed site.
  13. Delete failed over VMs on the failed site.
  14. Remove the added label scale.spectrum.ibm.com/designation=quorum for the two compute nodes.
    oc get nodes -l scale.spectrum.ibm.com/designation=quorum | grep compute
    Sample output:
    NAME                               STATUS ROLES AGE VERSION
    compute-1-ru6.rackae2.mydomain.com Ready worker 14d v1.23.17+16bcd69
    compute-1-ru7.rackae2.mydomain.com Ready worker 14d v1.23.17+16bcd69
    
    Run the command to remove labels added
    oc label node compute-1-ru6.rackae2.mydomain.com scale.spectrum.ibm.com/designation-
    oc label node compute-1-ru7.rackae2.mydomain.com scale.spectrum.ibm.com/designation-
    
    
  15. Verify whether the labels are removed from the compute nodes.
    oc get nodes -l scale.spectrum.ibm.com/designation=quorum | grep compute
  16. Scale down deployments and delete VMs of all failed-over applications or VMs on the recovered site:
    1. Check for application deployments under each namespace:
      oc get deployments --namespace=<namespace>
      Example command:
      oc get deployments --namespace=wordpressapp1-ns    
      NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
      Wordpressapp1         0/0     0            0           21h
      wordpressapp1-mysql   0/0     0            0           21h
      
    2. Scale down application deployments to 0.
      
      oc scale deployments/wordpressapp1 --namespace=wordpressapp1-ns --replicas 0
      oc scale deployments/wordpressapp1-mysql --namespace=wordpressapp1-ns --replicas 0
      
    3. Run the following command to delete the VMs.
      oc delete vm -n myvmnamespace
  17. Reapply network configurations on the surviving site.
    1. Re-install submariner.
      1. Get the status of Submariner.
        oc get mni metrodr-network -oyaml -n ibm-spectrum-fusion-ns
        Sample YAML:
        
        apiVersion: network.isf.ibm.com/v1
            ..
            reason: LastReconcileCycleSucceded
            status: "True"
            type: Available
          networkInstallStatus:
            messageCode:
              message: Submariner uninstalled successfully.
            progressPercentage: 100
        
        
      2. Edit Submariner CR.
        oc edit mni metrodr-network 
        
        Spec:
          installSubmariner: install
        
        
      3. Check installation status:
        oc get mni metrodr-network -oyaml -n ibm-spectrum-fusion-ns
        Output:
        message: Submariner installation completed successfully
    2. Add the following entries as follows:
      1. Run the following occommand.
        oc edit networks.operator.openshift.io/cluster
      2. Copy the daemon-network entries ( removed with step 6.a ) from the backup file.
        For example:
        name: daemon-network
        namespace: ibm-spectrum-scale
        rawCNIConfig: '{ "cniVersion": "0.3.1", "name": "daemon-network", "type": "bridge",
        "bridge": "br3201", "mtu": 9000, "ipam": { "type": "static","routes": [{ "dst":
        "192.168.192.0/18", "gw": "192.168.128.1"}, { "dst": "192.168.192.200/32", "gw":
        "192.168.128.1"}]}}'
        type: Raw
      3. Run the following command to check whether daemon-network entries are added correctly.
        oc get networks.operator.openshift.io/cluster -oyaml
      4. Restart the scale core pods to update the configuration.
    3. Check if the routes are present on scale code pods, else add route entries for each scale pod.
      
      oc project ibm-spectrum-scale
      oc get po
      Sample output:
      NAME                               READY   STATUS    RESTARTS      AGE
      compute-1-ru5                      2/2     Running   2             14d
      compute-1-ru6                      2/2     Running   9 (18h ago)   14d
      compute-1-ru7                      2/2     Running   9 (18h ago)   14d
      control-1-ru2                      2/2     Running   2             14d
      control-1-ru3                      2/2     Running   2             14d
      control-1-ru4                      2/2     Running   2             14d
      
      oc rsh compute-1-ru5 //any Running node
      
      route add -net 192.168.128.0 gw 192.168.192.1 netmask 255.255.192.0
      route -n
      Sample output:
      Kernel IP routing table
      Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
      0.0.0.0         10.135.0.1      0.0.0.0         UG    0      0        0 eth0
      10.135.0.0      0.0.0.0         255.255.254.0   U     0      0        0 eth0
      192.168.128.0   192.168.192.1   255.255.192.0   UG    0      0        0 net1
      192.168.192.0   0.0.0.0         255.255.192.0   U     0      0        0 net1
      192.168.192.200 192.168.192.1   255.255.255.255 UGH   0      0        0 net1
      
    4. Run the following command to check whether the routes are present.
      route -n
      If routes are not present, then add the routes and verify using the following commands.
      route add -net 192.168.128.0 gw 192.168.192.1 netmask 255.255.192.0
      route -n
    5. Check the scale pod status. Note that the status must be active.
      oc project ibm-spectrum-scale
      oc get po
      Sample output:
      NAME                               READY   STATUS    RESTARTS      AGE
      compute-1-ru5                      2/2     Running   2             14d
      compute-1-ru6                      2/2     Running   9 (18h ago)   14d
      compute-1-ru7                      2/2     Running   9 (18h ago)   14d
      control-1-ru2                      2/2     Running   2             14d
      control-1-ru3                      2/2     Running   2             14d
      control-1-ru4                      2/2     Running   2             14d
      
      
      oc rsh compute-1-ru5 //any Running node
      mmgetstate -a
      Sample output:
      Node number  Node name                                            GPFS state  
      -------------------------------------------------------------------------------
                 1  control-1-ru2.daemon.ibm-spectrum-scale.stg.rackae1  active
                 2  control-1-ru3.daemon.ibm-spectrum-scale.stg.rackae1  active
                 3  control-1-ru4.daemon.ibm-spectrum-scale.stg.rackae1  active
                 4  compute-1-ru5.daemon.ibm-spectrum-scale.stg.rackae1  active
                 5  compute-1-ru6.daemon.ibm-spectrum-scale.stg.rackae1  active
                 6  compute-1-ru7.daemon.ibm-spectrum-scale.stg.rackae1  active
                 7  compute-1-ru5.daemon.ibm-spectrum-scale.stg.rackae2  active
                 8  compute-1-ru6.daemon.ibm-spectrum-scale.stg.rackae2  active
                 9  compute-1-ru7.daemon.ibm-spectrum-scale.stg.rackae2  active
                10  control-1-ru2.daemon.ibm-spectrum-scale.stg.rackae2  active
                11  control-1-ru3.daemon.ibm-spectrum-scale.stg.rackae2  active
                12  control-1-ru4.daemon.ibm-spectrum-scale.stg.rackae2  active
                13  gpfs-tiebreaker                                      active