How to migrate from VSAN backed OSDs to LSO Passthrough backed OSDs in an ODF cluster?

How To

Summary

This document covers the steps to migrate OSDs and Monitors from VSAN to Passthrough devices.

Environment

Red Hat OpenShift Data Foundation 4.x

Steps

An ODF cluster on VMware supports multiple storageclass like vSAN or VMFS datastore via the vsphere-volume provisioner, VMDK, RDM, or DirectPath storage devices via the Local Storage Operator. However, these storage backends for OSD's may not be suitable throughput efficient workloads running with dsync flag. Based on the feedback from our Performance Engineering, we've concluded that adding and consuming Passthrough devices in such scenarios help to get the required throughput and latency.

Step 1 : Adding NVMe drives to the cluster

Make sure your cluster is healthy. To migrate to passthrough devices you' must add NVMe/PCIe devices to your worker node VMs.

Note:

Ensure that worker node VMs are distributed across multiple nodes within the vCenter cluster for high availability and data durability.

Cordon and drain the first worker node, shut down the worker node, then add NVMe controller, PCIe device and HDD (for the Monitor as mentioned below:

Identify the node
Mark the node as unschedulable
```
$ oc adm cordon <node_name>
```

Drain the node

$  oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonset

NOTE: This activity might take at least 5 - 10 minutes. Ceph warnings during this period are temporary and are automatically resolved when the node is labeled and returns to service.
Add NVMe controller, PCIe device, and HDD (for Monitors) as shown below
Repeat these steps for all nodes that host or may in the future host OSD and Monitor pods.

Preparing the drives

The NVMe devices must not be used for any other purpose. The drives should appear like in Figure 1 - Attached, and Not Consumed

Figure 1: The NVMe device is Attached and Not Consumed. Select the available NVMe device and note the multipath path. In this example it shows:

Path Selection Policy Fixed (VMware) - Preferred Path (vmhba2:C0:T0:L0)

Ensure the SSH service is available. In the host configuration screen, go to System → Services. Find the SSH service in the list and ensure it is Running

Figure 2: The SSH service must be running

Connect to the vSphere host via SSH as the root user and with the password you set during installation. Once connected, execute:

# lspci | grep NVMe

0000:af:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba1]

0000:b0:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba2]

0000:b1:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba3]

0000:b2:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba4]

0000:d8:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba5]

0000:d9:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba6]

Identify the device you noted earlier (in our case vmhba2) and note the PCI location (the first block on the line - in this case 0000:b0:00.0)

Go back to the vCenter UI and still in the Host configuration, scroll to “Hardware” → “PCI Devices”. In that view, click on “Configure Passthrough”. In that list you will find the disk you noted earlier by PCI path

Figure 3: Configure PCI passthrough for NVMe disk

Afterwards, you will see this new device listed as Available (pending) for the passthrough and it will trigger you to restart the hypervisor. Please reboot the hypervisor.

Figure 4: NVMe has been added to passthrough devices, but hypervisor has not yet been rebooted

Figure 5: NVMe device is Available after Hypervisor reboot

After the hypervisor has been rebooted, the NVMe device should be available, as shown in Figure 5.

Adding the device to the VM

Now that the NVMe device is prepared, we must attach it to the VM. For this, the VM must be powered down. Once the VM is off, open the VM settings and add these items:

Add an NVMe controller - This is optional, but will speed up storage requests in the VM
Add a PCIe device - the VM needs to be scheduled on the host where your PCI device is present
Add an HDD devices. The default 50 GB capacity is acceptable but 100 GB is suggested. This will be used for the Ceph Monitor filesystem

Afterwards, your VM settings should look similar to Figure 6.

Figure 6: VM settings after NVMe has been added

Extend the new PCIe / NVMe device as in Figure 6 and click the “Reserve all memory” button. Close the settings and power on the VM.

You can verify that the NVMe device has been successfully added by running lsblk on the VM.

Step 2 : Verify that the devices are added on to the worker nodes:

Execute “lsblk” on all worker nodes to verify device paths for the new NVMe devices and HDDs

$ lsblk
NAME                         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   60G  0 disk
|-sda1                         8:1    0  384M  0 part /boot
|-sda2                         8:2    0  127M  0 part /boot/efi
|-sda3                         8:3    0    1M  0 part
`-sda4                         8:4    0 59.5G  0 part
  `-coreos-luks-root-nocrypt 253:0    0 59.5G  0 dm   /sysroot
sdb                            8:16   0   50G  0 disk
nvme0n1                      259:0    0  1.5T  0 disk

Step 3 : Wipe existing NVMe devices if not in use. This is not required for new devices as they will be clean.

To clean NVMe drives, run sgdisk --zap-all /dev/nvmeXnX along with wipefs -a /dev/nvmXnX

Step 4: Create a new project “oc new-project local-storage” and deploy the LSO operator in that namespace via the UI

Step 5: Create local-block and local-fs storage classes using the devices from step 2

% cat <<EOF | oc create -n local-storage -f -
apiVersion: local.storage.openshift.io/v1
kind: LocalVolume
metadata:
  name: local-block
  namespace: local-storage
spec:
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
        - key: cluster.ocs.openshift.io/openshift-storage
          operator: Exists
  storageClassDevices:
    - storageClassName: local-block
      volumeMode: Block
      devicePaths:
        - /dev/nvme0n1
EOF

% cat <<EOF | oc create -n local-storage -f -
apiVersion: local.storage.openshift.io/v1
kind: LocalVolume
metadata:
  name: local-fs
  namespace: local-storage
spec:
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
        - key: cluster.ocs.openshift.io/openshift-storage
          operator: Exists
  storageClassDevices:
    - storageClassName: local-fs
      fsType: xfs
      volumeMode: Filesystem
      devicePaths:
        - /dev/sdb
EOF

The oc get pv command should show newly created persistent volumes using “local-fs” and “local-block” storage

% oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                                            STORAGECLASS                  VOLUMEATTRIBUTESCLASS   REASON   AGE
local-pv-13cf69d7                          50Gi       RWO            Delete           Available                                                                    local-fs                      <unset>                          91s
local-pv-4f6a02e6                          1490Gi     RWO            Delete           Available                                                                    local-block                   <unset>                          2m8s
local-pv-955b562b                          50Gi       RWO            Delete           Available                                                                    local-fs                      <unset>                          91s
local-pv-a63109ce                          50Gi       RWO            Delete           Available                                                                    local-fs                      <unset>                          91s
local-pv-dd0166a3                          1490Gi     RWO            Delete           Available                                                                    local-block                   <unset>                          2m9s
local-pv-e3f128be                          1490Gi     RWO            Delete           Available                                                                    local-block

Step 6 : Adding OSDs that use the local-block storage class

Scale down the rook-ceph-operator deployment and edit the storagecluster definition to add three OSDs with local-block storage

# oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage

Edit the storagecluster definition to create a new entry under storageDeviceSets:

# oc edit storagecluster -n openshift-storage

    - count: 1
      dataPVCTemplate:
        metadata: {}
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: "1"
          storageClassName: local-block
          volumeMode: Block
        status: {}
      name: ocs-deviceset-local
      placement: {}
      portable: false
      preparePlacement: {}
      replica: 3
      resources: {}

After updating the storagecluster definition, scale up the rook-ceph-operator pod. You will see new OSDs created that use the local-block storage class.

# oc scale deployment rook-ceph-operator --replicas=1 -n openshift-storage

Note: If OSD prepare jobs fail to create new OSDs with the local-block storage class, scale down the rook-ceph-operator pod, delete PVCs of local-block storage and delete all jobs from the openshift-storage namespace, then wipe the devices with sgdisk–zap-all. This should solve the issue upon scaling up the rook-ceph-operator pod

Step 7: Removal of OSDs that use the default storage class, in this example thin-csi-odf

Once the new OSDs built on local-block-storage are up, in, and and rebalancing is complete, remove the old OSDs from thin-csi-odf one by one. Once the old OSDs are removed, remove “thin-csi odf” related content from the storagecluster CR. Note that after removing one OSD it is CRUCIAL to wait for rebalancing (backfill/recovery) to complete before proceeding remove the next old OSD. Do not remove anything if the cluster shows any PGs backfilling, recovering, incomplete, undersized, or down, or if any of the new OSDs are not up and in.

To remove OSDs refer to https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.16/html-single/troubleshooting_openshift_data_foundation/index?extIdCarryOver=true&sc_cid=701f2000001Css5AAC#removing-failed-or-unwanted-ceph-osds-in-dynamically-provisioned-red-hat-openshift-data-foundation_rhodf

Step 8 : Remove the thin-csi-odf reference from the storagecluster CR

Once all thin-csi-odf OSDs are removed, modify the StorageCluster CR to remove thin-csi-odf

StorageCluster after the removal of thin-csi-odf should look like the below:

# oc edit storagecluster -n openshift-storage

…
  storageDeviceSets:
  - config: {}
    count: 1 
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources: 
          requests:
            storage: "1"
        storageClassName: local-block
        volumeMode: Block
      status: {}
    name: ocs-deviceset-local
    placement: {}
    portable: true
    preparePlacement: {}
    replica: 3
    resources: {}
status:

Step 9 : Change the Monitor storage from the thin-csi-odf storage class to local-fs

Follow the procedure in this document: https://access.redhat.com/solutions/6409071

Step 10 : If old OSDs are still present in the output of ceph osd df, remove them manually using the procedure documented here:

https://docs.ceph.com/en/reef/rados/operations/add-or-rm-osds/#removing-osds-manual

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB66","label":"Technology Lifecycle Services"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SSSEWFV","label":"Storage Fusion Data Foundation"},"ARM Category":[{"code":"a8m3p000000UoIPAA0","label":"Support Reference Guide"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Tips