Fusion Data Foundation service error scenarios

Use these troubleshooting information to know the problem and workaround when install or configure Fusion Data Foundation service.

Fusion Data Foundation is in a degraded state

Problem statement
The Fusion Data Foundation storage cluster is in a degraded state.
Cause
This issue might occurs due to many reasons. Follow the steps to find the root cause few the issues:
  1. Log in to the OpenShift Container Platform console.
  2. Go to Workloads > Pods.
  3. Select the Openshift-storage form the Projects drop down.
  4. Click CephClusters CR instance and go to YAML tab.
  5. Check status field to find the exact failure reason. If the issue occurs because of MDS trimming, then follow steps mentioned in resolution section to resolve the issue.
Resolution
  1. Log in to the OpenShift Container Platform console.
  2. Go to Workloads > Pods.
  3. Select the Openshift-storage form the Projects drop down.
  4. Select the rook-ceph-operator pod from the list.
  5. Go to Terminal tab, and run the following commands:
    
    export CEPH_ARGS='-c /var/lib/rook/openshift-storage/openshift-storage.config'
    ceph -s
    Example output:
    cluster:
        id:     0e7a1d51-ad98-4b33-ab71-4577de1b3e7d
        health: HEALTH_WARN
                2 MDSs behind on trimming
     
      services:
        mon: 3 daemons, quorum e,f,g (age 5d)
        mgr: a(active, since 14h), standbys: b
        mds: 1/1 daemons up, 1 hot standby
        osd: 5 osds: 5 up (since 6d), 5 in (since 6d)
        rgw: 1 daemon active (1 hosts, 1 zones)
     
      data:
        volumes: 1/1 healthy
        pools:   12 pools, 265 pgs
        objects: 1.54M objects, 1.3 TiB
        usage:   4.3 TiB used, 31 TiB / 35 TiB avail
        pgs:     265 active+clean
     
      io:
        client:   121 MiB/s rd, 119 MiB/s wr, 125 op/s rd, 165 op/s wr
    ceph config get mds.{id} mds_log_max_segments 128
    ceph config set mds mds_log_max_segments 256
    ceph config get mds.{id} mds_log_max_segments 256
    ceph -s
    Example output:
    cluster:
        id:     0e7a1d51-ad98-4b33-ab71-4577de1b3e7d
        health: HEALTH_OK
     
      services:
        mon: 3 daemons, quorum e,f,g (age 5d)
        mgr: a(active, since 14h), standbys: b
        mds: 1/1 daemons up, 1 hot standby
        osd: 5 osds: 5 up (since 6d), 5 in (since 6d)
        rgw: 1 daemon active (1 hosts, 1 zones)
     
      data:
        volumes: 1/1 healthy
        pools:   12 pools, 265 pgs
        objects: 1.54M objects, 1.3 TiB
        usage:   4.3 TiB used, 31 TiB / 35 TiB avail
        pgs:     265 active+clean
     
      io:
        client:   101 MiB/s rd, 109 MiB/s wr, 80 op/s rd, 128 op/s wr
For more information about the issue, see Red Hat Customer Portal.

Admission webhook warning

After you create Object Bucket Claim, the following error occurs:
Admission webhook warning :- ObjectBucketClaim my-obj-bucket violates policy 299 - "unknown field "spec.ss||"' after object bucket claim"*

For more details about this known issue, see Red Hat Known issues.

Local storage operator unable to find candidate storage nodes

Problem statement
When you configure a Fusion Data Foundation cluster, you do not find any candidate storage nodes.
Cause
When you configure Fusion Data Foundation cluster, only compute nodes with available disks (SSD/NVMe or HDD) get displayed in the Data Foundation page of IBM Fusion user interface. The following nodes get filtered out and do not display on the screen:
  • Nodes have SSD/NVMe or HDD disks but they are not in available state
  • The selected disk properties are not present in current node. For example, disk size or disk type.
  • The total disk count (with same disk size, disk type) is less than 3.
Steps to verify whether you have the correct storage node candidates
  1. In Red Hat OpenShift Container Platform console, go to Operators > Installed Operators.
  2. Verify whether the LocalStorage operator is installed successfully.
  3. Run the following command to get all the worker nodes:
    oc get node -l node-role.kubernetes.io/worker=
  4. Run the following command to check if discovery results are created for all worker nodes.
    oc get localvolumediscoveryresult -n openshift-local-storage
  5. Run the following command to confirm that none of the nodes have a Fusion Data Foundation storage label:
    oc get node -l cluster.ocs.openshift.io/openshift-storage=
Note: In Linux on IBM Z platform, disks might be formatted and partitioned first. For more information about this behavior, see Red Hat OpenShift Data Foundation on IBM Z and IBM LinuxONE - Reference Architecture section 4.1.1 and 4.1.2.
If all the above checks pass, but the node still could not be seen in the IBM Fusion user interface, then contact IBM support .

Fusion Data Foundation capacity cannot be loaded

If you encounter this issue in the Data foundation page of IBM Fusion user interface, then contact IBM support .

Fusion Data Foundation cluster fails due to pending StorageClusterPreparing stage

Problem statement
Here, the PVC is not created and the odfcluster status shows as follows:

conditions:
  - lastTransitionTime: "2022-12-01T15:09:47Z"
    message: storagecluster is not ready,install pending
    reason: StorageClusterPreparing
    status: "False"
    type: Ready
  phase: InProgress
  replica: 1
Diagnosis and resolution
To diagnose and fix the problem, do the following steps:
  1. Run the following command to open the storagecluster CR:
    oc get storageclusters.ocs.openshift.io -n openshift-storage ocs-storagecluster -o yaml
  2. Check whether the output of the command shows the following error message in the status:
    ConfigMap "ocs-kms-connection-details" not found'
    Output example:
    
    status:
    conditions:
    - lastHeartbeatTime: "2023-03-29T08:01:10Z"
    lastTransitionTime: "2023-03-29T07:49:47Z"
    message: 'Error while reconciling: some StorageClasses were skipped while waiting
    for pre-requisites to be met: [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd]'
    reason: ReconcileFailed
    status: "False"
    type: ReconcileComplete
  3. If you notice the error message, check the root-operator logs with the following command:
    oc logs -n openshift-storage $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name)
    Example output:
    2023-03-29 07:55:41.297073 E | ceph-cluster-controller: failed  to reconcile CephCluster  "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile
    cluster "ocs-storagecluster-cephcluster": failed to configure local ceph  cluster: failed to perform validation before cluster creation: failed  to validate kms connection details: failed to get backend version:  failed to list vault system mounts: Error making API
    request.
    URL: GET https://9.9.9.75:8200/v1/sys/mounts
    Code: 403. Errors: * permission denied

Data Foundation service configuration is incomplete

Problem statement
The Data Foundation storage cluster health is in a warning state.
Diagnosis and resolution
To diagnose and fix the problem, do the following steps:
  1. Run the following command to check whether any alerts exist in the CEPH cluster.
    oc rsh -n openshift-storage $(oc get pods -n openshift-storage -o name -l app=rook-ceph-operator)
  2. Run the following commands in the pod.
    sh-5.1$  export CEPH_ARGS='-c /var/lib/rook/openshift-storage/openshift-storage.config'
    sh-5.1$ ceph health detail
    Example output:
    HEALTH_WARN 3 daemons have recently crashed
    [WRN] RECENT_CRASH: 3 daemons have recently crashed
        osd.0 crashed on host rook-ceph-osd-prepare-ecf00ad4a684ca39c77ec263403cc29c-7vbn5 at 2024-07-29T09:04:58.992261Z
        osd.1 crashed on host rook-ceph-osd-prepare-f303dcfea8247471c14b744745e1b523-vl489 at 2024-07-29T09:04:59.283451Z
        osd.2 crashed on host rook-ceph-osd-prepare-a09bfae66e8278d3662bf8bb476a51fa-qzdln at 2024-07-29T09:04:59.932438Z
    sh-5.1$ ceph crash ls
    Example output:
    ID                                                                ENTITY  NEW  
    2024-07-29T09:04:58.992261Z_0e656801-e235-44f5-b64f-939d81ba2dc1  osd.0    *   
    2024-07-29T09:04:59.283451Z_b35f02e5-305b-42c4-bcbe-85fd5c0012fc  osd.1    *   
    2024-07-29T09:04:59.932438Z_3618dc07-7756-405d-838e-7b627d73e64c  osd.2    *
  3. Collect the DF must-gather to get more details about the issue. For steps to collect logs, see Collecting logs in IBM Fusion.
  4. If you find any alerts in the CEPH cluster, then run the following command to clear the alerts.
    sh-5.1$ ceph crash prune 0