Use this information to replace an operational or failed storage devices on VMware
infrastructure.
Before you begin
- Ensure that the data is resilient.
- In the OpenShift Web Console, click .
- Click the Storage Systems tab, and then click
ocs-storagecluster-storagesystem
.
- In the Status card of Block and File dashboard,
under the Overview tab, verify that Data Resiliency has a
green tick mark.
About this task
Create a new Persistent Volume Claim (PVC) on a new volume, and remove the old object storage
device (OSD) when one or more virtual machine disks (VMDK) needs to be replaced in Fusion Data Foundation which is deployed dynamically on VMware
infrastructure.
Procedure
-
Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has
the OSD scheduled on it.
oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
Example
output:
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none>
rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none>
rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>
In
this example, rook-ceph-osd-0-6d77d6c7c6-m8xj6
needs to be replaced and
compute-2
is the OpenShift Container platform node on which the OSD is
scheduled.
Note: If the OSD to be replaced is healthy, the status of the pod will be
Running
.
- Scale down the OSD deployment for the OSD to be replaced.
Each time you want
to replace the OSD, update the osd_id_to_remove parameter with the OSD ID, and
repeat this step.
$ osd_id_to_remove=0
oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
where, osd_id_to_remove is the integer in the pod name immediately
after the rook-ceph-osd
prefix. In this example, the deployment name is
rook-ceph-osd-0
.
Example
output:
deployment.extensions/rook-ceph-osd-0 scaled
- Verify that the
rook-ceph-osd
pod is terminated.
oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
Example output:
No resources found.
Important: If the
rook-ceph-osd
pod is in the
terminating state, use the
force option to delete the
pod.
oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0
Example
output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
- Remove the old OSD from the cluster so that you can add a new OSD.
- Delete any old
ocs-osd-removal
jobs.
oc delete -n openshift-storage job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted
- Navigate to the
openshift-storage
project.
oc project openshift-storage
- Remove the old OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
The FORCE_OSD_REMOVAL
value must be changed to true
in clusters
that only have three OSDs, or clusters with insufficient space to restore all three replicas of the
data after the OSD is removed.
Warning: This step results in OSD being completely removed from the cluster. Ensure that
the correct value of osd_id_to_remove
is provided.
- Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod. A status of
Completed
confirms that the OSD removal job
succeeded.
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
- Ensure that the OSD removal is completed
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Example
output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Important: If the
ocs-osd-removal-job
pod fails and the pod is not in the
expected
Completed state, check the pod logs for further debugging.
For
example:
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
- If encryption was enabled at the time of install, remove
dm-crypt
managed device-mapper
mapping from the OSD devices that are removed from the
respective Fusion Data Foundation nodes.
- Get the PVC name(s) of the replaced OSD(s) from the logs of
ocs-osd-removal-job
pod.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i ‘pvc|deviceset’
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
- For each of the previously identified nodes, do the following:
- Create a
debug
pod and chroot
to the host on the storage
node.oc debug node/<node name>
where
<node name> is the name of the
node.
$ chroot /host
- Find a relevant device name based on the PVC names identified in the previous
step.
dmsetup ls| grep <pvc name>
where
<pvc name> is the name of the PVC.
Example
output:
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
- Remove the mapped
device.
$ cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
Important: If the above command gets stuck due to insufficient privileges, run the following commands:
- Press
CTRL+Z
to exit the above command.
- Find the PID of the process which was stuck.
$ ps -ef | grep crypt
- Terminate the process using the kill
command.
kill -9 <PID>
where
<PID>is the process ID.
- Verify that the device name is removed.
$ dmsetup ls
- Delete the
ocs-osd-removal
job.
oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
job.batch "ocs-osd-removal-job" deleted
Note: When using an external key management system (KMS) with data encryption, the old OSD
encryption key can be removed from the Vault server as it is now an orphan key.
What to do next
- Verify that there is a new OSD
running.
oc get -n openshift-storage pods -l app=rook-ceph-osd
Example
output:rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s
rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h
rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
- Verify that there is a new PVC created which is in the Bound
state.
oc get -n openshift-storage pvc
Example
output:NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
ocs-deviceset-0-0-2s6w4 Bound pvc-7c9bcaf7-de68-40e1-95f9-0b0d7c0ae2fc 512Gi RWO thin 5m
ocs-deviceset-1-0-q8fwh Bound pvc-9e7e00cb-6b33-402e-9dc5-b8df4fd9010f 512Gi RWO thin 1d20h
ocs-deviceset-2-0-9v8lq Bound pvc-38cdfcee-ea7e-42a5-a6e1-aaa6d4924291 512Gi RWO thin 1d20h
- If cluster-wide encryption is enabled on the cluster, verify that the new
OSD devices are encrypted.
- Identify the nodes where the new OSD pods are running, where
<OSD-pod-name> is the name of the OSD
pod.
oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
For
example:
oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
Example
output:
NODE
compute-1
- For each of the nodes identified in the previous step, do the following:
- Create a debug pod and open a chroot environment for the selected host(s), where
<node name> is the name of the
node.
oc debug node/<node name>
$ chroot /host
- Check for the
crypt
keyword beside the ocs-deviceset
name(s).$ lsblk
- Log in to OpenShift Web Console and view the storage dashboard.