Operator upgrade fails - OLM known issue

Operator upgrade fails because the service account is deleted.

This is an Operator Lifecycle Manager (OLM) known issue. For more information about the issue, see Red Hat bugzilla Opens in a new tab.

Symptom

You usually see this issue during upgrade of IBM Cloud Pak foundational services in your cluster.

The operator status shows as installing, but the installation never completes. The operator pods show CrashLoopBackOff status.

Verify the operator status by using these commands:

  1. Check the status of the operator ClusterServiceVersion (CSV).

     oc get csv -n <your-foundational-services-namespace>
    

    Following is a sample output:

     ...
     ibm-platform-api-operator.v3.8.1                IBM Platform API                       3.8.1     ibm-platform-api-operator.v3.7.2                Installing
     ...
    
  2. Check the status of the operator pod.

     oc get pod -n <your-foundational-services-namespace>
    

    Following is a sample output:

     ...
     ibm-platform-api-operator-65f89cd85b-vz6t6                0/1     CrashLoopBackOff   8          18m
     ...
    
  3. Check the event of the operator pod.

     oc describe pod <pod-name> -n <your-foundational-services-namespace>
    

    Following is an example command and output:

     oc describe pod ibm-platform-api-operator-65f89cd85b-vz6t6 -n <your-foundational-services-namespace>
    
     Events:
     Type     Reason          Age                    From               Message
     ----     ------          ----                   ----               -------
     ...
     Normal   Started         33m (x3 over 34m)      kubelet            Started container ibm-platform-api-operator
     Warning  FailedMount     33m (x7 over 34m)      kubelet            MountVolume.SetUp failed for volume "ibm-platform-api-operator-token-tgvqx" : secret "ibm-platform-api-operator-token-tgvqx" not found
     ...
    

Cause

The issue happens due to an intermittent race condition that is seen during the operator upgrade.

The OLM accidentally deletes the service account of the operator.

Operator upgrade is managed by two operators: the OLM operator and the Catalog operator.

During the upgrade, the OLM operator erroneously considers that the upgrade is successfully finished. However, the Catalog operator does not update the owner reference of the service account, which causes the service account to be deleted by the OLM operator.

Resolving the problem

Delete the pod that is in the CrashLoopBackOff status.

oc delete pod <pod-name> -n <your-foundational-services-namespace>

Following is an example command:

oc delete pod ibm-platform-api-operator-65f89cd85b-vz6t6 -n <your-foundational-services-namespace>

After you delete the pod, the pod is re-created. The operator then successfully upgrades.