Operator installation or upgrade fails with DeadlineExceeded error

The IBM Cloud Pak foundational services operator ClusterServiceVersion (CSV) status shows Failed and its InstallPlan status shows Failed after the subscription is created.

Symptom

The foundational services operator CSV status shows as Failed. On the Red Hat® OpenShift® Container Platform cluster console, you see an error message similar to the following message:

Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline".

The InstallPlan status also shows as Failed.

  1. Get InstallPlans that have the Failed status.

     oc get subscription <failed-operator-subscription> -n <operator-namespace> -o jsonpath='{.status.installPlanRef}'
    

    See the following sample output:

     {"apiVersion":"operators.coreos.com/v1alpha1","kind":"InstallPlan","name":"install-cpqk2","namespace":"cloudpak-control","resourceVersion":"98650091","uid":"9f0210dd-a44f-454a-8b66-722b7520c838"}
    
  2. Confirm the status of the InstallPlan that you got in the previous command output.

     oc get installplan install-cpqk2 -n cloudpak-control -o jsonpath='{.status.installPlanRef}'
    

    Following is a sample output:

     Failed
    
  3. Inspect the InstallPlan.

     oc get installplan install-cpqk2 -n cloudpak-control -o yaml | grep "unpack job not completed"
    

    See the following sample output:

     bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline
    

Cause

Operator Lifecycle Manager (OLM) fails to unpack the operator bundle because the extract job failed (probably due to an issue of remote image access, or any other reason) and corrupted the configmap. Therefore, the operator manifest also most likely gets corrupted. When it is corrupted, any repeated installation attempts by using the same job and configmap fail.

In an air-gapped environment, in most cases the issue happens when the bundle image is unavailable in the private registry.

Resolution

For more information about how to resolve the issue, see Operator installation or upgrade fails with DeadlineExceeded in Red Hat OpenShift Container Platform 4.

Complete the following steps to resolve the issue:

  1. Find the corresponding job and configmap in the namespace where the CatalogSource is deployed. Narrow down your search by using the operator name or any other keyword.

     oc get job -n <catalog-namespace> -o json | jq -r '.items[] | select(.spec.template.spec.containers[].env[].value|contains ("<failed-operator-name>")) | .metadata.name'
    
  2. Delete the job and the corresponding configmap that you got in the previous command. In most cases, the job and configmap have the same name.

     oc delete job <job-name> -n <catalog-namespace>
    
     oc delete configmap <job-name> -n <catalog-namespace>
    
  3. Delete the Failed InstallPlan.

     oc delete installplan <operator-installplan-name> -n <operator-namespace>
    
  4. Delete the subscription and CSV of the Failed operator.

    oc delete subscription <name-of-the-operator-subscription> -n <operator-namespace>
    
    oc delete csv <name-of-the-corresponding-CSV> -n <operator-namespace>
    
  5. Delete the Operand Deployment Lifecycle Manager pod.

     oc delete pod -l name=operand-deployment-lifecycle-manager -n <your-foundational-services-namespace>