Troubleshooting installation and upgrade issues in IBM Storage Fusion services

Use these troubleshooting information to know install and upgrade problems related to IBM Storage Fusion services.

Important: If you face upgrade issues in Backup & Restore service, Backup & Restore (Legacy) service, or Data Cataloging service, check your OADP and AMQ Streams operator versions. For more information about the version requirement, see Red Hat AMQ Streams and OpenShift API for Data Protection (OADP) version requirements. Also, if you have Backup & Restore (Legacy) service, upgrade it first before you proceed with the upgrade of other services.

IBM Storage Fusion Global Data Platform service

Warning: Do NOT delete IBM Spectrum Scale pods. Deletion of Scale pods in many circumstances has implications on availability and data integrity.
CSI Pods experience scheduling problems post node drain
Whenever a compute node is drained, an eviction of CSI PODs or sidecar PODs occur. The remaining available set of compute nodes cannot host the CSI PODs or sidecar PODs because of resource constraints at that point in time.

Resolution:

Ensure that a functional system exists with available compute nodes, having sufficient resources to accommodate evicted CSI PODs or sidecar PODs.

IBM Storage Fusion Data Cataloging service

Data Cataloging service upgrade not available
Sometimes, the Data Cataloging service upgrade may not be available after you upgrade IBM Storage Fusion from 2.5.2 to 2.6.1.
Resolution:
  1. Confirm that the isd-catalog provides the new 212.0.0 version.
    oc get packagemanifest ibm-spectrum-discover-operator -o jsonpath='{.status.channels[0].currentCSVDesc.version}'
    
    Example output:
    212.0.0-1692058947
    
  2. Run the following command to backup the service subscription.
    oc -n ibm-data-cataloging get sub ibm-spectrum-discover-operator -o yaml > dcs-sub.yaml
    
  3. Run the following command to cleanup ClusterServiceVersions and Subscriptions:
    oc -n ibm-data-cataloging delete $(oc -n ibm-data-cataloging get csv,sub -o name | grep "amqstreams\|db2u-operator\|ibm-spectrum-discover-operator\|amq-streams")
    
  4. Create service subscription:
    oc -n ibm-data-cataloging apply -f dcs-sub.yaml
    
  5. Wait until the InstallPlan is available:
    oc -n ibm-data-cataloging get sub ibm-spectrum-discover-operator -o jsonpath='{.status.installplan.name}' -w
    
    Example output:
    
    
    install-jhpc2
    
  6. Approve the InstallPlan:
    oc -n ibm-data-cataloging patch ip $(oc -n ibm-data-cataloging get sub ibm-spectrum-discover-operator -o jsonpath='{.status.installplan.name}') --type merge --patch '{"spec":{"approved":true}}'
    
  7. Run the following command to check whether the service reports 2.1.2 as a upgrade version, and wait until the output 2.1.2 is displayed.
    oc -n ibm-data-cataloging get isd -o jsonpath='{.status.upgradeVersionAvailable}' -w
    
    Example output:
    2.1.2
    
  8. Start the service upgrade.
    oc -n ibm-data-cataloging patch $(oc -n ibm-data-cataloging get isd -o name) --type merge --patch '{"spec":{"triggerUpgrade":true}}'
    
  9. Monitor the progress of the upgrade from IBM Storage Fusion user interface.
Data Cataloging stuck at 80% and pods go into crashloopbackoff error
Symptoms:

The isd-db2whrest or isd-db-schema pods report a non ready or error state.

Run the following command to view the common logs:

oc -n ibm-data-cataloging logs -l 'role in (db2whrest, db-schema)' --tail=200

Go through the logs to check whether the following error exists:

Waiting on c-isd-db2u-engn-svc port 50001...

db2whconn - ERROR - [FAILED]: [IBM][CLI Driver] SQL1224N The database manager is not able to accept new requests, has terminated all requests in progress, or has terminated the specified request because of an error or a forced interrupt. SQLSTATE=55032

Connection refused
Resolution
  1. Restart Db2:
    
    oc -n ibm-data-cataloging rsh c-isd-db2u-0
    sudo wvcli system disable -m "Disable HA before Db2 maintenance"
    su - ${DB2INSTANCE}
    db2stop
    db2start
    db2 activate db BLUDB
    exit
    sudo wvcli system enable -m "Enable HA after Db2 maintenance"
  2. Confirm that Db2 HA-monitoring is active:
    
    sudo wvcli system status
    exit
    
  3. Restart db2whrest:
    oc -n ibm-data-cataloging delete pod -l role=db2whrest
    
  4. Verify that at least one db-schema pod succeeds.
    oc -n ibm-data-cataloging get pod | grep isd-db-schema | grep Completed
  5. If the previous step returned empty output, then recreate the job.
    
    SCHEMA_OLD="isd-db-schema-old.json"
    SCHEMA_NEW="isd-db-schema-new.json"
    oc -n ibm-data-cataloging get job isd-db-schema -o json > $SCHEMA_OLD
    jq 'del(.spec.template.metadata.labels."controller-uid") | del(.spec.selector) | del (.status)' $SCHEMA_OLD > $SCHEMA_NEW
    oc -n ibm-data-cataloging delete job isd-db-schema
    oc -n ibm-data-cataloging apply -f $SCHEMA_NEW
    
Data Cataloging upgrade from 2.5.2 to 2.6.1 show a lot of pods error and CrashLoopBackOff
As a resolution, recover from db-schema error during Data Cataloging upgrade:

Ensure that you have installed oc and jq commands.

Note: This procedure is applicable to recover DB2 that goes into an unavailable mode after the service installation.
  1. Stop DB2.
    
    oc -n ibm-data-cataloging rsh c-isd-db2u-0
    sudo wvcli system disable -m "Disable HA before Db2 maintenance"
    su - ${DB2INSTANCE}
    db2stop
  2. Start DB2 and activate the database.
    
    db2start
    db2 activate db BLUDB
    exit
    sudo wvcli system enable -m "Enable HA after Db2 maintenance"
  3. Confirm that the built-in HA monitoring is active.
    
    sudo wvcli system status
    exit
  4. Recreate db-schema job.
    
    SCHEMA_OLD="isd-db-schema-old.json"
    SCHEMA_NEW="isd-db-schema-new.json"
    oc -n ibm-data-cataloging get job isd-db-schema -o json > $SCHEMA_OLD
    jq 'del(.spec.template.metadata.labels."controller-uid") | del(.spec.selector) | del (.status)' $SCHEMA_OLD > $SCHEMA_NEW
    oc -n ibm-data-cataloging delete job isd-db-schema
    oc -n ibm-data-cataloging apply -f $SCHEMA_NEW
    
  5. Remove pods in CrashLoop state.
    oc -n ibm-data-cataloging delete pod $(oc -n ibm-data-catalging get pod | grep CrashLoop | awk '{print $1}')
    
Data Cataloging service is not installed successfully
Data catalog service is in installing state for hours. To resolve the problem, do the following steps:
  1. Label GPU nodes with a isf.ibm.com/node Type=gpu, the node name values depend on the GPU node names in ru25 and ru27:
    
    oc label node compute-1-ru25.mydomain.com -l isf.ibm.com/nodeType=gpu
    oc label node compute-1-ru27.mydomain.com -l isf.ibm.com/nodeType=gpu
  2. Patch FSD with new affinity to not schedule isd workload on those nodes:
    oc -n ibm-spectrum-fusion-ns patch fusionservicedefinitions.service.isf.ibm.com data-cataloging-service-definition  --patch "$(cat fsd_dcs_patch.yaml)" 
    The fsd_dcs_patch.yaml file:
    
    cat >> fsd_dcs_patch.yaml << EOF
    
    apiVersion: service.isf.ibm.com/v1
    kind: FusionServiceDefinition
    metadata:
      name: data-cataloging-service-definition
      namespace: ibm-spectrum-fusion-ns
    spec:
      onboarding:
        parameters:
          - dataType: string
            defaultValue: ibm-data-cataloging
            descriptionCode: BMYSRV00003
            displayNameCode: BMYSRV00004
            name: namespace
            required: true
            userInterface: false
          - dataType: storageClass
            defaultValue: ''
            descriptionCode: BMYDC0300
            displayNameCode: BMYDC0301
            name: rwx_storage_class
            required: true
            userInterface: true
          - dataType: bool
            defaultValue: 'true'
            descriptionCode: descriptionCode
            displayNameCode: displayNameCode
            name: doInstall
            required: true
            userInterface: false
          - dataType: json
            defaultValue: '{"accept": true}'
            descriptionCode: descriptionCode
            displayNameCode: displayNameCode
            name: license
            required: true
            userInterface: false
          - dataType: json
            defaultValue: '{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"isf.ibm.com/nodeType","operator":"NotIn","values":["gpu"]}]}]}}}'
            descriptionCode: descriptionCode
            displayNameCode: displayNameCode
            name: affinity
            required: true
            userInterface: false
        
    EOF
    
  3. Display the patch FSD:
    
    oc -n ibm-spectrum-fusion-ns get fusionservicedefinitions.service.isf.ibm.com data-cataloging-service-definition -o yaml
  4. Install from the user interface.
  5. Delete Data Cataloging namespace:
    oc delete ns ibm-data-cataloging
  6. Delete FSD instance:
    oc -n ibm-spectrum-fusion-ns delete fusionserviceinstances.service.isf.ibm.com data-cataloging-service-instance
Data Cataloging service not available
Data Cataloging service is not available whenever single node is down or maintenance mode is enabled. If Data Cataloging service is in degraded state, then check the nodesstatus and scale pod status to ensure that everything is up and running.
Data Cataloging installation is stuck at 35% for more than 1 hour
Cause:

Either instdb or restore-morph job is stuck in the sync step. Use oc -n ibm-data-cataloging logs -f jobs/c-isd-instdb and oc -n ibm-data-cataloging logs -f jobs/c-isd-restore-morph to determine whether the last line of logs is sync and it is not progressing.

A kubelet error can also cause the Data Cataloging installation to get stuck at 35% when a host that is not currently in use receives a port request.

Resolution:
  1. Identify the job that is stuck with sync (c-isd-instdb or c-isd-restore-morph) based on the logs using the commands provided in the Cause section.
  2. Run the following command to determine the name of the running node based on where the job was allocated:
    oc -n ibm-data-cataloging get pod -l job-name=<JOB_NAME> -o jsonpath='{.items[].spec.nodeName}'
    Here, <JOB_NAME> is either c-isd-instdb or c-isd-restore-morph.
  3. Do a graceful reboot of the affected node. For the actual procedure, see Red Hat OpenShift document.
  4. Run the following command to scale down the operator:
    oc -n ibm-data-cataloging scale --replicas=0 deployment/spectrum-discover-operator
  5. Run the following command to delete the Db2 instance:
    oc -n ibm-data-cataloging delete db2u isd
  6. Wait until the ibm-data-cataloging project is free of pods starting with c-isd.
  7. Run the following command to scale up the operator:
    oc -n ibm-data-cataloging scale --replicas=1 deployment/spectrum-discover-operator
Data Cataloging installation is stuck at 80% for more than 4 hours
Cause:
If you find that this issue is related to IBM Db2 resource, then manually restart Db2 deployment.
Note: This procedure recreates the database and also resolves connection authorization failures reported from isd-db-schema pod or similar database setup problems. It also applies to cases where the installation is stuck at 60%.
Resolution:
  1. Run the following command to scale down the operator:
    oc -n ibm-data-cataloging scale --replicas=0 deployment/spectrum-discover-operator
  2. Run the following command to delete the Db2 instance:
    oc -n ibm-data-cataloging delete db2u isd
  3. Run the following command to scale down workloads.
    oc -n ibm-data-cataloging scale --replicas=0 deployment,statefulset -l component=discover
    
  4. Run the following command to remove DB schema job.
    oc -n ibm-data-cataloging delete job isd-db-schema
    
  5. Wait until the Db2 pods and persistent volume claims get removed.
    oc -n ibm-data-cataloging get pod,pvc -o name | grep c-isd
  6. Run the following command to scale up the operator:
    oc -n ibm-data-cataloging scale --replicas=1 deployment/spectrum-discover-operator

IBM Storage Fusion Backup & Restore service

Backup & Restore stuck at 5% during upgrade
Diagnosis:
  1. In OpenShift® user interface, go to the Installed Operator and filter on ibm-backup-restore namespace.
  2. Click IBM Storage Fusion Backup and Restore Server.
  3. Go to Subscriptions.
  4. If you see the following errors, then do the workaround steps to resolve the issue:
    "error validating existing CRs against new CRD's schema for "guardiancopyrestores.guardian.isf.ibm.com": error validating custom resource against new schema for GuardianCopyRestore"
Resolution:
  1. From command line or command prompt, log in to the cluster and run the following commands to delete the guardiancopyrestore CRs and 2.6.0 cs:
    oc -n ibm-backup-restore delete guardiancopyrestore.guardian.isf.ibm.com --all
    oc -n ibm-backup-restore delete csv guardian-dm-operator.v2.6.0 
    
  2. From OpenShift Container Platform console, go to Installed Operator and filter on ibm-backup-restore namespace.
  3. Click IBM Storage Fusion Backup and Restore Server.
  4. Go to Subscriptions.
  5. Find the failing installplan and delete it.
  6. Go to Installed Operator and go to ibm-backup-restore namespace.
  7. Find IBM Storage Fusion Backup and Restore Server and click Upgrade available and approve the Install plan for IBM Storage Fusion Backup and Restore Server.
  8. Wait for the service upgrade to resume.
MongoDB pod crashes with CrashLoopBackOff status
The MongoDB pod crashes due to OOM error. To resolve the error, increase the memory limit from 256Mi to 512Mi. Do the following steps to change the memory limit:
  1. Log in to the Red Hat OpenShift web console as an administrator.
  2. Go to Workloads > StatefulSet.
  3. Select the project ibm-backup-restore.
  4. Select the MongoDB pod, and go to the YAML tab.
  5. In the YAML, change the memory limit for MongoDB container from 256Mi to 512Mi.
  6. Click Save.
storage-operator pod crashes with CrashLoopBackOff status
The storage-operator pod crashes due to OOM error. To resolve the error, increase the memory limit from 300Mi to 500Mi.
  1. Log in to the Red Hat OpenShift web console as an administrator.
  2. Select the project ibm-spectrum-fusion-ns.
  3. Select the isf-storage-operator-controller-manager-xxxx pod, and go to the YAML tab.
  4. In the YAML, change the memory limit for isf-storage-operator-controller-manager-xxxx container from 300Mi to 500Mi.
    
    containers:
        - resources:
            limits:
              cpu: 100m
              memory: 500Mi
  5. Click Save.
Pods in Crashloopbackoff state after upgrade
The Backup & Restore service health changes to unknown and two pods go into Crashlookbackoff state.

Resolution:

In the resource settings of guardian-dp-operator pod that resides in ibm-backup-restore namespace, set the value of IBM Storage Fusion operator memory limits to 1000mi.

Example:

resources:    
          limits:    
            cpu: 1000m    
            memory: 1000Mi    
          requests:    
            cpu: 500m    
            memory: 250Mi 
Backup & Restore service installation stuck at 80%
The Backup & Restore service installation stuck at 80% because the mongodb-0 pod is not ready.

The metrics exporter container in mongodb pods sometimes takes a while to start, causing the container to fail the liveness probe check and the pod to restart. This can cause the mongodb replicaset configuration to be in an incomplete state if the restart happens while it is being configured. Consequently, the readiness probes fail because the pods are not configured as primary or secondary, which leads to the backup and restore install being stuck.

Resolution:

Whenever this issue is encountered, then the replicaset configuration is in an incomplete state and needs to be cleaned up. Please contact IBM support so that they can confirm the root cause of the problem and help with the cleanup.

IBM Storage Fusion Backup & Restore (Legacy) service

The IBM Spectrum Protect Plus server could not be contacted with 'no route available' message after the IBM Spectrum Protect Plus server reinstall
After the re-installation of the IBM Spectrum Protect Plus server, the transaction-manager pods went into a 1/3 crashloop state and the transaction-manager-worker pods indicated that the IBM Spectrum Protect Plus server could not be contacted with a no route available message.
To resolve the error, reinstall the IBM Spectrum Protect Plus agent.
ImagePull failure during Backup & Restore (Legacy) installation or upgrade
If an ImagePull failure occurs on the Virgo pod during the installation or upgrade of Backup & Restore (Legacy), then as a resolution restart the Backup & Restore (Legacy) Virgo pod in the ibm-spectrum-protect-plus-ns namespace:
  1. Go to OpenShift Container Platform web management console.
  2. Go to Workloads > Pods.
  3. Select ibm-spectrum-protect-plus-ns project.
  4. Search for the sppvirgo pod.
  5. From the Actions menu, click Delete pod to re-spin it.

Common issues

ImagePull failure during installation or upgrade of any service
If an ImagePull failure occurs during the installation or upgrade of the any service, then restart the pod and retry. If the issue persists, contact IBM support.
Rook-cephfs pods are in CrashLoopBackOff off state
Data Cataloging and Backup & Restore services that are installed, go into degraded state, and rook-ceph-mds-ocs-storagecluster pods are in CrashLoopBackOff state.
Follow the steps to resolve the issue:
  1. If IBM Storage Fusion is installed with OpenShift Container Platform or Data Foundation v4.10.x, then you can upgrade it to v4.11.x.
  2. The upgrade of OpenShift Container Platformor Data Foundation from v4.10.x to v4.11.x resolves the CrashLoopBackOff error in rook-ceph-mds-ocs-storagecluster pods and get it to running state. Eventually services also go to healthy state.
    Note: Also, you can upgrade OpenShift Container Platform or Data Foundation to further versions supported by IBM Storage Fusion.
Whenever the upgrade button is unavailable for any service
During the offline upgrade, if you do not see the upgrade button for any of the services after upgrading IBM Storage Fusion operator, then check the catalogsource pod, and it should be in a running state. For any imagepullbackoff error, ensure you have completed mirroring and updated imagecontentsourcepolicy.