Shutting down and restarting IBM Fusion HCI rack with Global Data Platform

Procedure to gracefully restart the IBM Fusion rack with Global Data Platform storage.

Before you begin

Install OpenShift® Command Line Interface (CLI):
  1. Log in to OpenShift Container Platform web console.
  2. Click ? in the title bar, and click Command Line Tools.

    The Command Line Tools page is displayed.

  3. In the Command Line Tools, click Download oc for <your platform>.
  4. Save the file.
  5. Unpack the downloaded archive file.
  6. Move the oc binary to a directory on your path.
  7. Run the file to install the OpenShift CLI.

Procedure

  1. Capture system health check before bringing down the rack. It helps to check for any preexisting issues post power-on.
    1. Ensure no machine config or update is in progress or no node is not ready
    2. To verify, run the following commands:
      oc get co
      oc get clusterversion
    3. Run the following commands to list the pods, cluster operators, and nodes.
      
      oc get po -A | grep -v Running | grep -v Completed
      oc get nodes
    4. Change to ibm-spectrum-scale namespace:
      oc project ibm-spectrum-scale   
    5. Log in to a running pod. For example, compute-1-ru5 pod:
      oc rsh compute-1-ru5
    6. Run the following command to get the state of the GPFS daemon on one or more nodes.
      mmgetstate -a
    7. Run the following command to display the current configuration information for a GPFS cluster.
      mmlscluster
    Note: The health check must be saved to a different system.
  2. Check whether there exists any active Backup & Restore jobs. If Backup & Restore or application synch is in progress, then wait for them to complete. Wait for in progress workload operations to complete. Before you proceed with the shutdown of the storage cluster, ensure that no data is in progress for any job or application.
  3. Run the following steps based on whether your rack is a stand alone or is in a disaster recovery setup (Metro-DR):
    Stand-alone
    1. Run the following command to shut down:
      mmshutdown -a
    2. Run the following command to verify whether all nodes are down:
      
      mmgetstate -a 
    3. Exit from the pod
      exit
    Metro-DR

    If you plan to shut down a site, ensure that you failover your applications to the other site.

    1. Shutdown scale pods on affected site by using the mmshutdown directly in the pod Terminal.
    2. Run exit to exit from the pod
  4. Run the following storage commands to shut down the storage cluster.
    1. Switch the project to ibm-spectrum-scale-operator.
      oc project ibm-spectrum-scale-operator
    2. Set the replicas in the deployment configuration:
      oc scale --replicas=0 deployment ibm-spectrum-scale-controller-manager
    3. Switch the project to ibm-spectrum-scale.
      oc project ibm-spectrum-scale  
    4. Log in to compute-1-ru<x>:
      oc rsh compute-1-ru<x> 
  5. If you have enabled IBM Data Cataloging, then place the service in an idle state on the Red Hat® OpenShift environment. For more information about the shut down procedure in IBM Data Cataloging, see Graceful shutdown.
  6. Shut down the Red Hat OpenShift Container Platform cluster.
    1. If the cluster-wide proxy is enabled, be sure to export the NO_PROXY, HTTP_PROXY, and HTTPS_PROXY environment variables, on bastion node from where you intend to run oc commands. To check whether the proxy is enabled run below command:
      oc get proxy cluster -o yaml
    2. Take etcd backup.
      oc debug node/<node_name> (any one control node)
      sh-4.15# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
    3. Copy the etcd backup to external system.
      snapshot_.db and static_kuberesources_.tar.gz
      You can use the oc rsync command to copy the files to an external system. You need two terminals for this operation.
      1. Open terminal one.
      2. Run the following commands for etcd backup:
        
        oc debug node/<node_name> 
        sh-4.15# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
        In oc debug node/<node_name> command, use any one control node.
      3. Run the following command and record the new pod name:

        It is the source pod, and the backup files reside inside the pod.

        oc debug
        Do not close the terminal 1.
      4. Open terminal two and run the following command to copy the file to the local folder:
        oc -n <namespace_of_debug_pod> rsync <source_podname_in_above_step>:/home/core/assets/backup/snapshot_.db <local_folder_path> 

        If required, add the namespace of the debug node pod location.

      5. Repeat the step ii to copy another backup file to the external system.
      6. Close the terminal windows after all the files are copied.
      For more information about the procedure, see Copy local files to or from a remote directory.
    4. Ensure that you take off the workloads before you shut down the nodes.
    5. Run the following commands to shut down the nodes:
      Ensure that the control node hosting the IBM Fusion operators is powered off last. Shutting down this node prematurely results in loss of access to both the IBM Fusion and OpenShift Container Platform user interfaces.
      Finally, shutdown the OpenShift control plane nodes.
      
      for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}');
      do oc debug node/${node} -- chroot /host shutdown -h 1;
      done
      After 3 to 5 minutes, the Red Hat OpenShift Container Platform becomes inaccessible.

      This step brings down all the software on the rack. The rack is ready to be powered off.

  7. Physically press the power off button of the nodes.
    Note:
    • This physical power off indicates to the Baseboard Management Controller (BMC) that you intend to keep the node powered down and prevents automatic restart.
    • The switches do not have the option to shutdown, and they can only be rebooted. When you power off the entire rack (unplugged), the switches shut down automatically. Similarly, when the power is restored to the rack, the switches comes up automatically.
  8. Power on the rack.
    1. Power on the rack.
    2. Go to the physical node and click the power button to power on all the nodes.
      Power on all control nodes. After all control nodes are up, power on compute nodes.
    3. After all the nodes are up and cluster operators are up (except image registry), run the following commands to ensure that the OpenShift cluster is up along with the IBM Fusion operators.
      oc get po -A | grep -v Running | grep -v Completed
      oc get co  
      oc get nodes
    4. For Global Data Platform, bring back the Scale.
      oc project ibm-spectrum-scale-operator
         oc scale --replicas=1 deployment ibm-spectrum-scale-controller-manager

      Give it a few minutes and check the cluster or storage dashboard.

    5. Run the following commands to ensure that the storage pods are up:
      Global Data Platform
      1. Switch namespace to ibm-spectrum-scale:
        oc project ibm-spectrum-scale
        
      2. Verify whether all pods are in running state in the ibm-spectrum-scale project:
        oc get pods 
      3. To run commands on a node, run the following rsh command:
        oc rsh compute-t-ru<x>
      4. Run the following command to get the state of the GPFS daemon on one or more nodes.
        mmgetstate -a
      5. Switch project to ibm-spectrum-scale-csi:
        oc project ibm-spectrum-scale-csi
      6. Verify whether all pods are in running state in the ibm-spectrum-scale-csi project. This may take sometime.
        oc get pods
  9. Bring back IBM Data Cataloging to a running state.