How To
Summary
This article can guide you through some validation checks to ensure your ODF/FDF cluster is in a healthy state and ready for an upgrade.
Objective
ODF/FDF upgrades usually occur after OpenShift Container Platform (OCP) upgrades shown here: Red Hat OpenShift Data Foundation Supportability and Interoperability Guide. OCP upgrades can sometimes cause clock skews, pod disruption budget issues, PVCs still in use or in some instances, the OCP upgrade did not finish before the user proceeded to upgrade the ODF/FDF operator. Following these checks prior to the upgrade decreases the likelihood of the ODF/FDF Operator failing the upgrade.
Environment
Red Hat OpenShift Container Platform (OCP) v4.x
Red Hat OpenSift Data Foundation (ODF) v4.x
IBM Fusion Data Foundation (FDF) v4.x
Steps
Prerequisites:
Once the OCP upgrade is complete. Confirm that the cluster operators have been upgraded successfully. capture the output of
If there are error messages related to the cluster operators failing the OCP upgrade, identify which cluster operator it is. Match the cluster operator with the project/namespace with:
$ oc get co to confirm that there are no error messages reported on the cluster operators (also shown in OCP UI under "Administration" -> "Cluster Settings"). Additionally, after allowing some time to pass once the masters/workers have rebooted/upgraded, then proceed to upgrade the ODF Operator.If there are error messages related to the cluster operators failing the OCP upgrade, identify which cluster operator it is. Match the cluster operator with the project/namespace with:
$ oc projects, then perform the following:- Switch into the ID'd project with:
$ oc project <project-name> - Run
$ oc get pods - ID pods that are having issues e.g.
TerminatingCrashLoopBackOffErrorContainer Creating - Delete the pods with:
$ oc delete pod <pod-name> - That process usually fixes the issue however, keep in mind what has just now taken place. The OCP upgrade issue preventing OCP from successfully upgrading, such as a Pod Disruption Budget (PDB) error, PVC still in use, etc. is resolved. OCP will continue its upgrade path. Please be patient and allow plenty of time post-OCP upgrade, then proceed to the Solution section of this article to perform the ODF Operator pre-checks. If OCP continues to fail the upgrade, please contact the Shift Install Upgrade team.
Solution:
Prior to the ODF upgrade, simply checking Ceph's health before the OCS upgrade will be crucial. It can be done so with the following:
- Go to OCP console, go to Home -> Overview and look for the green check mark next to Storage. That usually indicates
HEALTH__OKin Ceph. - To double-check Ceph's health (highly recommend), run the two following commands to ensure Ceph is in
HEALTH_OKall PGs areactive+cleanand there are no significant clock skews:
Ceph Status
$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyringCeph Time Sync Status
$ NAMESPACE=openshift-storage;ROOK_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}');oc exec -it ${ROOK_POD} -n ${NAMESPACE} -- ceph time-sync-status --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyringExample Outputs:
$ ceph status
health: HEALTH_OK <----------------------------------------- CONFIRM HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 3h)
mgr: a(active, since 6h)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
osd: 27 osds: 27 up (since 2h), 27 in (since 111m)
rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)
data:
pools: 10 pools, 1136 pgs
objects: 5.50M objects, 3.3 TiB
usage: 9.9 TiB used, 43 TiB / 53 TiB avail
pgs: 1136 active+clean <--------------------------------- CONFIRM ALL PG'S ACTIVE+CLEAN
io:
client: 93 KiB/s rd, 2.0 MiB/s wr, 5 op/s rd, 29 op/s wr
$ ceph time-sync-status
{
"time_skew_status": {
"a": {
"skew": 0, <----------------------------------------- CONFIRM NO SKEW
"latency": 0,
"health": "HEALTH_OK"
},
"b": {
"skew": 0, <----------------------------------------- CONFIRM NO SKEW
"latency": 0.0014192947479820786,
"health": "HEALTH_OK"
},
"c": {
"skew": 0, <----------------------------------------- CONFIRM NO SKEW
"latency": 0.00063846546083029998,
"health": "HEALTH_OK"
}
},
"timechecks": {
"epoch": 104924,
"round": 74,
"round_status": "finished"
}
}If the above checks are made and everything appears to be healthy, then the likelihood of upgrading successfully with no issues has increased significantly.
If significant clock skews are present, one Solution that will help solve this issue is the How to resolve MON clock skew issue. Otherwise please engage IBM support for assistance and quote this solution.
This solution involves deploying a debug pod inside ODF nodes (the node(s) the Ceph monitor pods are scheduled on that are reporting the skews). Simply run
$ oc get pods -n openshift-storage -o wide to ID the monitor with the clock skew, and the node it's scheduled on.Additionally, you may get a
501 Not authorized error message when running the chronyc -a makestep command. This is normal, just ignore the message. The forced makestep did in fact occur.You can confirm by re-running the
$ ceph time-sync-status (step 2b) to validate that there are no significant clock skews still present.For vSphere environments - Please validate your current OCP and vShphere version compatibilities / Infrastructure requirements prior to upgrade. In addition, it would also be good to check for other potential configuration issues with the "vSphere Problem Detector Operator". While these checks are between vSphere and OCP, improper configurations there can ultimately negatively impact an ODF upgrade as well as other applications in this environment.
Lastly, check the ODF Operator "Conditions" prior to the upgrade.
$ oc get storagecluster -n openshift-storage
Validate PHASE:Ready
Raw
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 94m Ready <--- Verify 2023-04-24T14:30:10Z 4.10.0or
Navigate to the OCP Console GUI. Click on "Installed Operators" -> "OpenShift Data Foundations." Click the "Subscriptions" tab. Scroll down and review the subscriptions to confirm
AllCatalogSourcesHealthy.NOTE: If the output of
$ oc get storagecluster -n openshift-storage is not consistent with PHASE: Ready or the conditions presented under the ODF Operator "Subscriptions" tab is anything other than AllCatalogSourcesHealthy, please open a case with Red Hat Support to address these issues prior to upgrading ODF.Additional Information
Diagnostic Steps Examples:
$ ceph time-sync-status
{
"time_skew_status": {
"z": {
"skew": 0,
"latency": 0,
"health": "HEALTH_OK"
},
"ac": {
"skew": 0,
"latency": 0.00051476226577585975,
"health": "HEALTH_OK"
},
"ad": {
"skew": -5.8758086181640636e-05, <-------------- skew
"latency": 0.00032825447724227754,
"health": "HEALTH_OK"
}
},
"timechecks": {
"epoch": 2042,
"round": 286,
"round_status": "finished"
}
}
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.12.10 True False False 26h
baremetal 4.12.10 True False False 325d
cloud-controller-manager 4.12.10 True False False 325d
cloud-credential 4.12.10 True False False 325d
cluster-autoscaler 4.12.10 True False False 325d
config-operator 4.12.10 True False False 325d
console 4.12.10 True False False 40d
control-plane-machine-set 4.12.10 True False False 2d1h
csi-snapshot-controller 4.12.10 True False False 154d
dns 4.12.10 True True False 325d DNS "default" reports Progressing=True: "Have 8 available DNS pods, want 9."
etcd 4.12.10 True False False 325d
image-registry 4.12.10 False True True 29h NodeCADaemonAvailable: The daemon set node-ca has available replicas...
ingress 4.12.10 True False False 26h
insights 4.12.10 True False False 2d1h
kube-apiserver 4.12.10 True False False 325d
kube-controller-manager 4.12.10 True False False 325d
kube-scheduler 4.12.10 True False False 325d
kube-storage-version-migrator 4.12.10 True False False 29h
machine-api 4.12.10 True False False 325d
machine-approver 4.12.10 True False False 325d
machine-config 4.12.10 True False True 23h Failed to resync 4.12.10 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 6, ready 1, updated: 1, unavailable: 1)]
marketplace 4.12.10 True False False 325d
monitoring 4.12.10 True False False 30h
network 4.12.10 True True True 325d DaemonSet "/openshift-multus/multus-additional-cni-plugins" rollout is not making progress - last change 2023-04-13T23:37:18Z
node-tuning 4.12.10 True False False 2d
openshift-apiserver 4.12.10 True False False 60d
openshift-controller-manager 4.12.10 True False False 2d
openshift-samples 4.12.10 True False False 2d
operator-lifecycle-manager 4.12.10 True False False 325d
operator-lifecycle-manager-catalog 4.12.10 True False False 325d
operator-lifecycle-manager-packageserver 4.12.10 True False False 239d
service-ca 4.12.10 True False False 325d
storage 4.12.10 True False False 325dDocument Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB66","label":"Technology Lifecycle Services"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SSSEWFV","label":"Storage Fusion Data Foundation"},"ARM Category":[{"code":"a8m3p000000UoIUAA0","label":"Documentation"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]
Product Synonym
FDF;ODF
Was this topic helpful?
Document Information
Modified date:
22 December 2025
UID
ibm17039800