IBM Support

OpenShift Data Foundation and Fusion Data Foundation (ODF/FDF) Operator Upgrade Pre-Checks

How To


Summary

This article can guide you through some validation checks to ensure your ODF/FDF cluster is in a healthy state and ready for an upgrade.

Objective

ODF/FDF upgrades usually occur after OpenShift Container Platform (OCP) upgrades shown here: Red Hat OpenShift Data Foundation Supportability and Interoperability Guide. OCP upgrades can sometimes cause clock skews, pod disruption budget issues, PVCs still in use or in some instances, the OCP upgrade did not finish before the user proceeded to upgrade the ODF/FDF operator. Following these checks prior to the upgrade decreases the likelihood of the ODF/FDF Operator failing the upgrade.

Environment

Red Hat OpenShift Container Platform (OCP) v4.x

Red Hat OpenSift Data Foundation (ODF) v4.x

IBM Fusion Data Foundation (FDF) v4.x
 

Steps

Prerequisites:
 
Once the OCP upgrade is complete. Confirm that the cluster operators have been upgraded successfully. capture the output of $ oc get co to confirm that there are no error messages reported on the cluster operators (also shown in OCP UI under "Administration" -> "Cluster Settings"). Additionally, after allowing some time to pass once the masters/workers have rebooted/upgraded, then proceed to upgrade the ODF Operator.
If there are error messages related to the cluster operators failing the OCP upgrade, identify which cluster operator it is. Match the cluster operator with the project/namespace with: $ oc projects, then perform the following:
  1. Switch into the ID'd project with: $ oc project <project-name>
  2. Run $ oc get pods
  3. ID pods that are having issues e.g. Terminating CrashLoopBackOff Error Container Creating
  4. Delete the pods with: $ oc delete pod <pod-name>
  5. That process usually fixes the issue however, keep in mind what has just now taken place. The OCP upgrade issue preventing OCP from successfully upgrading, such as a Pod Disruption Budget (PDB) error, PVC still in use, etc. is resolved. OCP will continue its upgrade path. Please be patient and allow plenty of time post-OCP upgrade, then proceed to the Solution section of this article to perform the ODF Operator pre-checks. If OCP continues to fail the upgrade, please contact the Shift Install Upgrade team.
Solution:

Prior to the ODF upgrade, simply checking Ceph's health before the OCS upgrade will be crucial. It can be done so with the following:
  1. Go to OCP console, go to Home -> Overview and look for the green check mark next to Storage. That usually indicates HEALTH__OK in Ceph.
  2. To double-check Ceph's health (highly recommend), run the two following commands to ensure Ceph is in HEALTH_OK all PGs are active+clean and there are no significant clock skews:
Ceph Status
 
$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
Ceph Time Sync Status
 
$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph time-sync-status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
Example Outputs:
 
$ ceph status

    health: HEALTH_OK   <----------------------------------------- CONFIRM HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 3h)
    mgr: a(active, since 6h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 27 osds: 27 up (since 2h), 27 in (since 111m)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)

  data:
    pools:   10 pools, 1136 pgs
    objects: 5.50M objects, 3.3 TiB
    usage:   9.9 TiB used, 43 TiB / 53 TiB avail
    pgs:     1136 active+clean   <--------------------------------- CONFIRM ALL PG'S ACTIVE+CLEAN

  io:
    client:   93 KiB/s rd, 2.0 MiB/s wr, 5 op/s rd, 29 op/s wr


$ ceph time-sync-status
{
    "time_skew_status": {
        "a": {
            "skew": 0,   <----------------------------------------- CONFIRM NO SKEW
            "latency": 0,
            "health": "HEALTH_OK"
        },
        "b": {
            "skew": 0,  <----------------------------------------- CONFIRM NO SKEW
            "latency": 0.0014192947479820786,
            "health": "HEALTH_OK"
        },
        "c": {
            "skew": 0,  <----------------------------------------- CONFIRM NO SKEW
            "latency": 0.00063846546083029998,
            "health": "HEALTH_OK"
        }
    },
    "timechecks": {
        "epoch": 104924,
        "round": 74,
        "round_status": "finished"
    }
}
If the above checks are made and everything appears to be healthy, then the likelihood of upgrading successfully with no issues has increased significantly.

If significant clock skews are present, one Solution that will help solve this issue is the How to resolve MON clock skew issue. Otherwise please engage IBM support for assistance and quote this solution.

This solution involves deploying a debug pod inside ODF nodes (the node(s) the Ceph monitor pods are scheduled on that are reporting the skews). Simply run $ oc get pods -n openshift-storage -o wide to ID the monitor with the clock skew, and the node it's scheduled on.

Additionally, you may get a 501 Not authorized error message when running the chronyc -a makestep command. This is normal, just ignore the message. The forced makestep did in fact occur.

You can confirm by re-running the $ ceph time-sync-status (step 2b) to validate that there are no significant clock skews still present.
 
For vSphere environments - Please validate your current OCP and vShphere version compatibilities / Infrastructure requirements prior to upgrade. In addition, it would also be good to check for other potential configuration issues with the "vSphere Problem Detector Operator". While these checks are between vSphere and OCP, improper configurations there can ultimately negatively impact an ODF upgrade as well as other applications in this environment.
 
Lastly, check the ODF Operator "Conditions" prior to the upgrade.
 
$ oc get storagecluster -n openshift-storage
 
Validate PHASE:Ready
 
Raw
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   94m   Ready <--- Verify  2023-04-24T14:30:10Z   4.10.0
or
 
Navigate to the OCP Console GUI. Click on "Installed Operators" -> "OpenShift Data Foundations." Click the "Subscriptions" tab. Scroll down and review the subscriptions to confirm AllCatalogSourcesHealthy.
 
NOTE: If the output of $ oc get storagecluster -n openshift-storage is not consistent with PHASE: Ready or the conditions presented under the ODF Operator "Subscriptions" tab is anything other than AllCatalogSourcesHealthy, please open a case with Red Hat Support to address these issues prior to upgrading ODF.

Additional Information

Diagnostic Steps Examples:
 
$ ceph time-sync-status
{
    "time_skew_status": {
        "z": {
            "skew": 0,
            "latency": 0,
            "health": "HEALTH_OK"
        },
        "ac": {
            "skew": 0,
            "latency": 0.00051476226577585975,
            "health": "HEALTH_OK"
        },
        "ad": {
            "skew": -5.8758086181640636e-05, <-------------- skew
            "latency": 0.00032825447724227754,
            "health": "HEALTH_OK"
        }
    },
    "timechecks": {
        "epoch": 2042,
        "round": 286,
        "round_status": "finished"
    }
}


$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.10   True        False         False      26h
baremetal                                  4.12.10   True        False         False      325d
cloud-controller-manager                   4.12.10   True        False         False      325d
cloud-credential                           4.12.10   True        False         False      325d
cluster-autoscaler                         4.12.10   True        False         False      325d
config-operator                            4.12.10   True        False         False      325d
console                                    4.12.10   True        False         False      40d
control-plane-machine-set                  4.12.10   True        False         False      2d1h
csi-snapshot-controller                    4.12.10   True        False         False      154d
dns                                        4.12.10   True        True          False      325d    DNS "default" reports Progressing=True: "Have 8 available DNS pods, want 9."
etcd                                       4.12.10   True        False         False      325d
image-registry                             4.12.10   False       True          True       29h     NodeCADaemonAvailable: The daemon set node-ca has available replicas...
ingress                                    4.12.10   True        False         False      26h
insights                                   4.12.10   True        False         False      2d1h
kube-apiserver                             4.12.10   True        False         False      325d
kube-controller-manager                    4.12.10   True        False         False      325d
kube-scheduler                             4.12.10   True        False         False      325d
kube-storage-version-migrator              4.12.10   True        False         False      29h
machine-api                                4.12.10   True        False         False      325d
machine-approver                           4.12.10   True        False         False      325d
machine-config                             4.12.10   True        False         True       23h     Failed to resync 4.12.10 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 6, ready 1, updated: 1, unavailable: 1)]
marketplace                                4.12.10   True        False         False      325d
monitoring                                 4.12.10   True        False         False      30h
network                                    4.12.10   True        True          True       325d    DaemonSet "/openshift-multus/multus-additional-cni-plugins" rollout is not making progress - last change 2023-04-13T23:37:18Z
node-tuning                                4.12.10   True        False         False      2d
openshift-apiserver                        4.12.10   True        False         False      60d
openshift-controller-manager               4.12.10   True        False         False      2d
openshift-samples                          4.12.10   True        False         False      2d
operator-lifecycle-manager                 4.12.10   True        False         False      325d
operator-lifecycle-manager-catalog         4.12.10   True        False         False      325d
operator-lifecycle-manager-packageserver   4.12.10   True        False         False      239d
service-ca                                 4.12.10   True        False         False      325d
storage                                    4.12.10   True        False         False      325d
 
 

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB66","label":"Technology Lifecycle Services"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SSSEWFV","label":"Storage Fusion Data Foundation"},"ARM Category":[{"code":"a8m3p000000UoIUAA0","label":"Documentation"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Product Synonym

FDF;ODF

Document Information

Modified date:
22 December 2025

UID

ibm17039800