Fusion Data Foundation service error scenarios
Use these troubleshooting information to know the problem and workaround when install or configure Fusion Data Foundation service.
- Fusion Data Foundation is in a degraded state
- Red Hat OpenShift Data Foundation storage node failure
- Red Hat OpenShift Data Foundation Object Storage Device (OSD) failure
- Admission webhook warning
- Local storage operator unable to find candidate storage nodes
- Fusion Data Foundation capacity cannot be loaded
- Fusion Data Foundation cluster fails due to pending StorageClusterPreparing stage
- Data Foundation service configuration is incomplete
Fusion Data Foundation is in a degraded state
- Problem statement
- The Fusion Data Foundation storage cluster is in a degraded state.
- Cause
- This issue might occurs due to many reasons. Follow the steps to find the root cause few the issues:
- Log in to the OpenShift Container Platform console.
- Go to .
- Select the Openshift-storage form the Projects drop down.
- Click CephClusters CR instance and go to YAML tab.
- Check status field to find the exact failure reason. If the issue occurs because of MDS trimming, then follow steps mentioned in resolution section to resolve the issue.
- Resolution
- Log in to the OpenShift Container Platform console.
- Go to .
- Select the Openshift-storage form the Projects drop down.
- Select the rook-ceph-operator pod from the list.
- Go to Terminal tab, and run the following commands:
Example output:export CEPH_ARGS='-c /var/lib/rook/openshift-storage/openshift-storage.config' ceph -s
cluster: id: 0e7a1d51-ad98-4b33-ab71-4577de1b3e7d health: HEALTH_WARN 2 MDSs behind on trimming services: mon: 3 daemons, quorum e,f,g (age 5d) mgr: a(active, since 14h), standbys: b mds: 1/1 daemons up, 1 hot standby osd: 5 osds: 5 up (since 6d), 5 in (since 6d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 265 pgs objects: 1.54M objects, 1.3 TiB usage: 4.3 TiB used, 31 TiB / 35 TiB avail pgs: 265 active+clean io: client: 121 MiB/s rd, 119 MiB/s wr, 125 op/s rd, 165 op/s wr
Example output:ceph config get mds.{id} mds_log_max_segments 128 ceph config set mds mds_log_max_segments 256 ceph config get mds.{id} mds_log_max_segments 256 ceph -s
cluster: id: 0e7a1d51-ad98-4b33-ab71-4577de1b3e7d health: HEALTH_OK services: mon: 3 daemons, quorum e,f,g (age 5d) mgr: a(active, since 14h), standbys: b mds: 1/1 daemons up, 1 hot standby osd: 5 osds: 5 up (since 6d), 5 in (since 6d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 265 pgs objects: 1.54M objects, 1.3 TiB usage: 4.3 TiB used, 31 TiB / 35 TiB avail pgs: 265 active+clean io: client: 101 MiB/s rd, 109 MiB/s wr, 80 op/s rd, 128 op/s wr
Admission webhook warning
After you create Object Bucket Claim, the following error occurs:
Admission webhook warning :- ObjectBucketClaim my-obj-bucket violates policy 299 - "unknown field "spec.ss||"' after object bucket claim"*
For more details about this known issue, see Red Hat Known issues.
Local storage operator unable to find candidate storage nodes
- Problem statement
- When you configure a Fusion Data Foundation cluster, you do not find any candidate storage nodes.
- Cause
- When you configure Fusion Data Foundation cluster, only
compute nodes with available disks (SSD/NVMe or HDD) get displayed in the Data
Foundation page of IBM Fusion user
interface. The following nodes get filtered out and do not display on the screen:
- Nodes have SSD/NVMe or HDD disks but they are not in available state
- The selected disk properties are not present in current node. For example, disk size or disk type.
- The total disk count (with same disk size, disk type) is less than 3.
- Steps to verify whether you have the correct storage node candidates
-
- In Red Hat OpenShift Container Platform console, go to .
- Verify whether the
LocalStorage
operator is installed successfully. - Run the following command to get all the worker
nodes:
oc get node -l node-role.kubernetes.io/worker=
- Run the following command to check if discovery results are created for all worker
nodes.
oc get localvolumediscoveryresult -n openshift-local-storage
- Run the following command to confirm that none of the nodes have a Fusion Data Foundation storage
label:
oc get node -l cluster.ocs.openshift.io/openshift-storage=
Note: In Linux on IBM Z platform, disks might be formatted and partitioned first. For more information about this behavior, see Red Hat OpenShift Data Foundation on IBM Z and IBM LinuxONE - Reference Architecture section 4.1.1 and 4.1.2.
Fusion Data Foundation capacity cannot be loaded
If you encounter this issue in the Data foundation page of IBM Fusion user interface, then contact IBM support .
Fusion Data Foundation cluster fails due to pending StorageClusterPreparing stage
- Problem statement
- Here, the PVC is not created and the
odfcluster
status shows as follows:conditions: - lastTransitionTime: "2022-12-01T15:09:47Z" message: storagecluster is not ready,install pending reason: StorageClusterPreparing status: "False" type: Ready phase: InProgress replica: 1
- Diagnosis and resolution
- To diagnose and fix the problem, do the following steps:
- Run the following command to open the
storagecluster
CR:oc get storageclusters.ocs.openshift.io -n openshift-storage ocs-storagecluster -o yaml
- Check whether the output of the command shows the following error message in the status:
ConfigMap "ocs-kms-connection-details" not found'
Output example:status: conditions: - lastHeartbeatTime: "2023-03-29T08:01:10Z" lastTransitionTime: "2023-03-29T07:49:47Z" message: 'Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd]' reason: ReconcileFailed status: "False" type: ReconcileComplete
-
If you notice the error message, check the root-operator logs with the following
command:
oc logs -n openshift-storage $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name)
Example output:2023-03-29 07:55:41.297073 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to get backend version: failed to list vault system mounts: Error making API request. URL: GET https://9.9.9.75:8200/v1/sys/mounts Code: 403. Errors: * permission denied
- Run the following command to open the
Data Foundation service configuration is incomplete
- Problem statement
- The Data Foundation storage cluster health is in a warning state.
- Diagnosis and resolution
- To diagnose and fix the problem, do the following steps:
- Run the following command to check whether any alerts exist in the CEPH
cluster.
oc rsh -n openshift-storage $(oc get pods -n openshift-storage -o name -l app=rook-ceph-operator)
- Run the following commands in the
pod.
sh-5.1$ export CEPH_ARGS='-c /var/lib/rook/openshift-storage/openshift-storage.config'
Example output:sh-5.1$ ceph health detail
HEALTH_WARN 3 daemons have recently crashed [WRN] RECENT_CRASH: 3 daemons have recently crashed osd.0 crashed on host rook-ceph-osd-prepare-ecf00ad4a684ca39c77ec263403cc29c-7vbn5 at 2024-07-29T09:04:58.992261Z osd.1 crashed on host rook-ceph-osd-prepare-f303dcfea8247471c14b744745e1b523-vl489 at 2024-07-29T09:04:59.283451Z osd.2 crashed on host rook-ceph-osd-prepare-a09bfae66e8278d3662bf8bb476a51fa-qzdln at 2024-07-29T09:04:59.932438Z
Example output:sh-5.1$ ceph crash ls
ID ENTITY NEW 2024-07-29T09:04:58.992261Z_0e656801-e235-44f5-b64f-939d81ba2dc1 osd.0 * 2024-07-29T09:04:59.283451Z_b35f02e5-305b-42c4-bcbe-85fd5c0012fc osd.1 * 2024-07-29T09:04:59.932438Z_3618dc07-7756-405d-838e-7b627d73e64c osd.2 *
- Collect the DF must-gather to get more details about the issue. For steps to collect logs, see Collecting logs in IBM Fusion.
- If you find any alerts in the CEPH cluster, then run the following command to clear the
alerts.
sh-5.1$ ceph crash prune 0
- Run the following command to check whether any alerts exist in the CEPH
cluster.