Shutting down and restarting IBM Storage Fusion HCI System rack
Procedure to gracefully restart the IBM Storage Fusion rack.
Before you begin
- Log in to OpenShift Container Platform web console.
- Click ? in the title bar, and click Command Line
Tools.
The Command Line Tools page is displayed.
- In the Command Line Tools, click Download oc for <your platform>.
- Save the file.
- Unpack the downloaded archive file.
- Move the oc binary to a directory on your path.
- Run the file to install the OpenShift CLI.
Procedure
-
Capture health check report before bringing down the rack. It helps to check for any
preexisting issues post power-on.
Note: The health report must be saved to a different system.
- Global Data Platform
- For Global Data Platform storage, run the following OC
commands to get the system health status before shutdown:
- Run the following commands to list the pods, cluster operators, and nodes.
oc get po -A | grep -v Running | grep -v Completed oc get co oc get nodes
- Change to
ibm-spectrum-scale
namespace:oc project ibm-spectrum-scale
- Log in to a running pod. For example, compute-1-ru5
pod:
oc rsh compute-1-ru5
- Run the following command to get the state of the GPFS daemon on one or more
nodes.
mmgetstate -a
- Run the following command to display the current configuration information for a GPFS
cluster.
mmlscluster
- Run the following commands to list the pods, cluster operators, and nodes.
- Fusion Data Foundation
- For Fusion Data Foundation storage, run the following OC
commands to get the system health status before shutdown:
- Check whether all cluster operators are Available and not in Degraded state. To verify, run the
following commands:
oc get co
- Check whether any updates are Progressing on the cluster. To verify, run the following command:
oc get clusterversion
- Check that the Fusion Data Foundation cluster is in a
healthy state.
- In the Red Hat® OpenShift web console, click Storage > Data Foundation.
- In the Status card of the Overview tab, clickStorage system.
- In the notification window, click the Storage system link.
- In the Status card of theBlock and File tab, verify that the storage cluster is in a healthy state.
- In the Details card, verify the cluster information.
- Check whether all cluster operators are Available and not in Degraded state. To verify, run the
following commands:
- Check whether there exists any active Backup & Restore jobs. If Backup & Restore or application synch is in progress, then wait for them to complete. Wait for in progress workload operations to complete. Before you proceed with the shutdown of the storage cluster, ensure that no data is in progress for any job or application.
- Run the following steps based on whether your rack is a stand alone or is in a disaster
recovery setup (Metro-DR):
- Stand-alone
-
- Run the following command to shut down:
mmshutdown -a
- Run the following command to verify whether all nodes are down:
mmgetstate -a
- Exit from the pod
exit
- Run the following command to shut down:
- Metro-DR
-
If you plan to shut down a site, ensure that you failover your applications to the other site.
- Shutdown scale pods on affected site by using the mmshutdown directly in the pod Terminal.
- Run
exit
to exit from the pod
- Run the following commands to shut down storage cluster.
- Global Data Platform cluster
-
- Switch the project to
ibm-spectrum-scale-operator
.oc project ibm-spectrum-scale-operator
- Set the replicas in the deployment configuration:
oc scale --replicas=0 deployment ibm-spectrum-scale-controller-manager
- Switch the project to
ibm-spectrum-scale
.oc project ibm-spectrum-scale
- Log in to compute-1-ru<x>:
oc rsh compute-1-ru<x>
- Switch the project to
- Fusion Data Foundation
- For Fusion Data Foundation nodes with status Running, stop the applications that consume storage from the Fusion Data Foundation to stop the I/O. Hence, the shutdown sequence is important to avoid any data corruption and pod failures. The shutdown of the application components must to be done before you shut down the nodes. Scale down the application deployments so that the application pods do not get started on other nodes in case node selectors are not set. Also, this must stop storage I/O from the applications.
- Place the Data Cataloging service in an idle state on the Red Hat OpenShift environment. For more information about the procedure, see Graceful shutdown.
- Shut down the Red Hat
OpenShift Container Platform cluster.
- If the cluster-wide proxy is enabled, be sure to export the NO_PROXY, HTTP_PROXY, and
HTTPS_PROXY environment variables. To check if the proxy is enabled, run
oc get proxy cluster -o yaml
. - Take etcd backup.
oc debug node/<node_name> (any one control node)
sh-4.15# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
- Copy the etcd backup to external
system.
You can use thesnapshot_.db and static_kuberesources_.tar.gz
oc rsync
command to copy the files to an external system. You need two terminals for this operation.- Open terminal one.
- Run the following commands for etcd backup:
Inoc debug node/<node_name> sh-4.15# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
oc debug node/<node_name>
command, use any one control node. - Run the following command and record the new pod name:
It is the source pod, and the backup files reside inside the pod.
Do not close the terminal 1.oc debug
- Open terminal two and run the following command to copy the file to the local folder:
oc -n <namespace_of_debug_pod> rsync <source_podname_in_above_step>:/home/core/assets/backup/snapshot_.db <local_folder_path>
If required, add the namespace of the debug node pod location.
- Repeat the step ii to copy another backup file to the external system.
- Close the terminal windows after all the files are copied.
- Ensure that you take off the workloads before you shut down the nodes.
- Run the following commands to shut down the nodes: The node with OpenShift Container Platform user interface and IBM Storage Fusion user interface must be the last node you must power off.For Fusion Data Foundation, you must first shut down the application nodes and then shutdown the Fusion Data Foundation nodes. Finally, shutdown the OpenShift control plane nodes.
After 3 to 5 minutes, the Red Hat OpenShift Container Platform becomes inaccessible.for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do oc debug node/${node} -- chroot /host shutdown -h 1; done
This step brings down all the software on the rack. The rack is ready to be powered off.
- If the cluster-wide proxy is enabled, be sure to export the NO_PROXY, HTTP_PROXY, and
HTTPS_PROXY environment variables. To check if the proxy is enabled, run
- Physically press the power off button of the nodes and the rack. Note: The switches do not have the option to shutdown, and they can only be rebooted. When you power off the entire rack (unplugged), the switches shut down automatically. Similarly, when the power is restored to the rack, the switches comes up automatically.
- Power on the rack.
- Power on the rack.
- Go to the physical node and click the power button to power on all the nodes.
Power on all control nodes. After all control nodes are up, power on compute nodes.For Fusion Data Foundation, power on the nodes in this sequence:
- Start the Control Plane nodes.
- Start the Fusion Data Foundation nodes.
- Verify that these pods are in a running state:
It is possible that all Fusion Data Foundation pods may not be Running by now, but it is expected to haveoc get pods -n openshift-storage
rook-ceph-{mds-*,mon-*,mgr-*,osd-*}
pods to be running to serve storage. If the Ceph cluster is not in a good shape, it results in a FailedMounts with infra and application pods. - Start the infra nodes.
- Start the application nodes.
- After all the nodes are up and cluster operators are up (except image registry), run
the following commands to ensure that the OpenShift cluster is up along with the IBM Storage Fusion operators.
oc get po -A | grep -v Running | grep -v Completed oc get co oc get nodes
- For Global Data Platform, bring back the Scale.
oc project ibm-spectrum-scale-operator oc scale --replicas=1 deployment ibm-spectrum-scale-controller-manager
Give it a few minutes and check the cluster or storage dashboard.
For Fusion Data Foundation, scale up the application deployments.
- Run the following commands to ensure that the storage pods are up:
- Global Data Platform
-
- Switch namespace to
ibm-spectrum-scale
:oc project ibm-spectrum-scale
- Verify whether all pods are in running state in the
ibm-spectrum-scale
project:oc get pods
- To run commands on a node, run the following
rsh
command:oc rsh compute-t-ru<x>
- Run the following command to get the state of the GPFS daemon on one or more
nodes.
mmgetstate -a
- Switch project to
ibm-spectrum-scale-csi
:oc project ibm-spectrum-scale-csi
- Verify whether all pods are in running state in the
ibm-spectrum-scale-csi
project. This may take sometime.oc get pods
- Switch namespace to
- Fusion Data Foundation
- Run the following command to validate all pods are up in the
openshift-{storage,monitoring,logging,image-registry}
and the application namespaces:oc get pods -n <NAMESPACE>
- Bring back Data Cataloging to a running state.
For the procedure, see Returning Data Cataloging to a running state.