Troubleshooting upgrade issues

Troubleshooting IBM Fusion HCI upgrade issues.

Install strategy fails for Fusion Operator after cluster upgrade to Red Hat OpenShift Container Platform 4.15.3

Problem statement
The IBM Fusion HCI 2.7.2 is upgraded to 2.8.0 with Red Hat® OpenShift® Container Platform 4.14.x. If Red Hat OpenShift Container Platform is upgraded to 4.15.2 or higher in this setup, then the Fusion operator status in OperatorHub fails with the following error:
install strategy failed: rolebindings.rbac.authorization.k8s.io "isf-update-operator-controller-manager-service-auth-reader" already exists 
Cause
The error occurs because of a known Red Hat OpenShift Container Platform issue. For more information about the issue, see https://issues.redhat.com/projects/OCPBUGS/issues/OCPBUGS-32311?filter=allopenissues.
Resolution
  1. Run the following command to check all the existing role bindings.
    oc get csv isf-operator.v2.8.0 -ojson | jq
            '.status.conditions[].message' -n ibm-spectrum-fusion-ns  
    Sample output of the command:
    "all requirements found, attempting install"
    "install strategy failed: rolebindings.rbac.authorization.k8s.io \"isf-application-operator-controller-manager-service-auth-reader\" already exists"
    "webhooks not installed"
    "all requirements found, attempting install"
    "install strategy failed: rolebindings.rbac.authorization.k8s.io \"isf-application-operator-controller-manager-service-auth-reader\" already exists"
    "webhooks not installed"
    "all requirements found, attempting install"
    "install strategy failed: rolebindings.rbac.authorization.k8s.io \"isf-application-operator-controller-manager-service-auth-reader\" already exists"
    
  2. Take back up of the YAMLs of the reported role bindings.
    oc get rolebinding isf-application-operator-controller-manager-service-auth-reader -n kube-system -o yaml > isf-application-operator-controller-manager-service-auth-reader_rb.yaml
    
  3. Run the following command to delete each reported role binding:
    oc delete rolebinding
            isf-application-operator-controller-manager-service-auth-reader -n kube-system
  4. Iterate through steps 1, 2, and 3 until the IBM Fusion HCI operator CSV reports Healthy and the Fusion operator status shows Succeeded.

catalogsource isf-catalog does not get updated with status

Problem statement
If you upgrade from 4.14 to 4.15, then it is fusion-catalog and not isf-catalog.
Resolution
  1. Log in to the OpenShift Container Platform console as a cluster administrator.
  2. Create a new CatalogSource by using the YAML editor.
    Sample catalogsource YAML for online upgrade:
    
    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: fusion-catalog
      namespace: openshift-marketplace
    spec:
      displayName: IBM Fusion Catalog
      image: 'icr.io/cpopen/isf-operator-catalog:2.8.0-linux.amd64'
      publisher: IBM
      sourceType: grpc
    Sample catalogsource YAML for offline upgrade:
    
    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: fusion-catalog
      namespace: openshift-marketplace
    spec:
      displayName: IBM Fusion Catalog
      image: $TARGET_PATH/isf-operator-catalog:2.8.0-linux.amd64'
      publisher: IBM
      sourceType: grpc
  3. Save the YAML.
  4. Confirm that the CatalogSource 'fusion-catalog' is in Ready state:
    1. Go to Home > Search.
    2. Change namespace to openshift-marketplace.
    3. In Resources, find CatalogSource.
    4. From the list, select fusion-catalog. The Details tab opens by default.
    5. Confirm that the status is Ready in the Details page.
  5. Go to Operators > Installed Operators and make sure that you select ibm-spectrum-fusion-ns project.
  6. From the Installed Operators list, select IBM Fusion that is on 2.7.2 version. The Details tab opens by default.
  7. Go to Subscription tab and check whether the Update approval is Manual or Automatic. If it is Automatic, change the Update approval to Manual.
  8. In the Update approval section, click edit icon and change the channel value to v2.0.
  9. Go to Actions and select Edit Subscription.
  10. In the YAML tab, update the value of the source in the Spec section to fusion-catalog.
  11. Save the YAML.
  12. Proceed with step 8 of Upgrading IBM Fusion HCI management software topic.

Restore failures post upgrade

Problem statement
After upgrade, you might encounter some restore failures. Check whether the job logs of the failed restore jobs contain the following error:
Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.capabilities.add: Invalid value: "SYS_ADMIN": capability may not be added, provider "nonroot":
Resolution
Contact IBM Support to resolve this known issue.

Community operator catalog is shown as missing

Resolution
If the Community operator catalog is shown as missing, then create it before you attempt upgrade.

Machine config roll out error

Resolution
If an operation causes machine config roll out and gets stuck for a long time, then check whether the node to be updated is pingable and has an IP after restart. If there exist any DHCP or network issues that prevent the node from getting a hostname, then fix them and restart the node.

On-demand backup failures post upgrade

Problem statement
Post upgrade, on-demand backup failures might happen for existing applications.
Resolution
Do the following manual steps after you upgrade to avoid this problem:
  1. Run the following command to display the phase status of the backup policies associated with all your applications:
    oc get fpa -A
    Example output:
    oc get fpa -A
    
    NAMESPACE                NAME                                  PROVIDER     APPLICATION           BACKUPPOLICY      DATACONSISTENCY   PHASE             LASTBACKUPTIMESTAMP   CAPACITY
    ibm-spectrum-fusion-ns   deptest2-azure-hourly-30              isf-ibmspp   deptest2              azure-hourly-30                     Assigned          66m                   <no value>
    ibm-spectrum-fusion-ns   new-generic-1-azure-hourly-45         isf-ibmspp   new-generic-1         azure-hourly-45                     Assigned          21m                   <no value>
    ibm-spectrum-fusion-ns   new-mongo-project-1-azure-hourly-15   isf-ibmspp   new-mongo-project-1   azure-hourly-15                     InitializeError   81m                   <no value>
    ibm-spectrum-fusion-ns   new-mongo-project-azure-hourly-30     isf-ibmspp   new-mongo-project     azure-hourly-30                     Assigned          66m                   <no value>
  2. Verify whether your policyassignment CR corresponds to any application in InitializeError phase. In this example, the new-mongo-project-1 application is in InitializeError phase.
  3. Log in to IBM Fusion HCI user interface.
  4. Go to Applications > Backups tab.
  5. Unassign the backup policy that is assigned to the application in InitializeError phase and wait for its unassignment. In this example, unassign azure-hourly-15 policy from new-mongo-project-1 application.
  6. Reassign the backup policy.

ImagePull failure

Resolution
If an ImagePull failure occurs due to intermittent network or registry issue during an upgrade, then restart the pod and retry. If the issue persists, contact IBM support.

DeadlineExceeded error

Problem statement
IBM Cloud Paks foundational services operator ClusterServiceVersion (CSV) status shows Failed and its InstallPlan status shows Failed after the subscription gets created.
Resolution
If you notice that the operator installation or upgrade fails with DeadlineExceeded error, see Operator installation or upgrade fails with DeadlineExceeded error.

Operator OOMKilled error in IBM Fusion namespace

Problem statement
The pods go into crash loop state with the OOMKilled error after OpenShift Container Platform upgrade.
Resolution
Follow steps 1 through 4 below to address the OOMKilled error related to the isf-update-operator:
  1. Go to IBM Fusion clusterserviceversion object (Operators > Installed Operators > IBM Fusion operator > YAML tab).
  2. Search for the deployment name of the isf-update-operator (isf-update-operator-controller-manager) from the list of deployments in the clusterserviceversion object under spec.install.spec.deployments.
  3. In the specified deployment object, search for the container name manager under the spec.template.spec.containers and increase the memory limit in the resources.limits.memory.
  4. After changing the limits in the IBM Fusion clusterserviceversion, the update operator pod restarts with the new limits.
  5. If the OOMKilled issue still persists, then follow the steps 1 - 4 again.