Troubleshooting the apply-olm command during installation or upgrade

When you run the cpd-cli manage apply-olm command to install or upgrade operators, the command might fail for various reasons.

If the apply-olm command fails, Red Hat® OpenShift® Operator Lifecycle Manager (OLM) encountered a problem. A slow Kubernetes API server, sequencing issues, or timing issues can cause problems with OLM. Some of these problems might be because of inconsistent operator metadata or defects with older OLM versions.

The following sections provide guidance to help you diagnose and resolve problems with OLM so you can successfully install or upgrade Cloud Pak for Data operators. Follow the recommended order of tasks. If you complete these tasks out of order, you might create more problems with your cluster or deployments. The following diagram shows the process you should follow when you troubleshoot errors with the apply-olm command:

This image shows a diagram of the flow customers must follow when they troubleshoot issues with the apply-olm command.

The apply-olm command fails

When the apply-olm command fails, see what type of error is returned:
Error Action
The error is a problem with a catalog source. Complete the steps in Check the catalog sources.
The error is a problem with a subscription. Complete the steps in Check the OLM operator logs.

Check the catalog sources

  1. Inspect the catalog sources:
    for catsrc in $(oc get catalogsource -n ${PROJECT_CPD_INST_OPERATORS} \
    --sort-by=.metadata.creationTimestamp -o name); \
    do \
    oc get $catsrc -n ${PROJECT_CPD_INST_OPERATORS} -o jsonpath='{.metadata.name},{.status.connectionState.lastObservedState}{"\n"}'; \
    done
  2. Complete the appropriate step based on the output of the preceding step:

    Error Action
    Catalog sources are Running. Complete the steps in Check the OLM operator logs.
    Catalog sources are not Running. Complete the steps in Inspect failed pod logs in the operators project for the instance.

Check the OLM operator logs

Inspect the logs for the OLM operator:

oc logs -n openshift-operator-lifecycle-manager \
$(oc get pods -n openshift-operator-lifecycle-manager -lapp=catalog-operator -o name) | grep ${PROJECT_CPD_INST_OPERATORS}

After you run this command, review messages that begin with ResolutionFailed constraints not satisfiable. Focus on the most recent messages in the logs. These messages might provide helpful information to identify problems with your OLM configuration.

If you're unable to identify and correct the problem based on the information in the logs, complete the steps in Check for orphaned CSVs and unbound subscriptions.

Inspect failed pod logs in the operators project for the instance

  1. Check whether pods are failing in the project where the Cloud Pak for Data operators are installed (PROJECT_CPD_INST_OPERATORS):
    oc get pods -n ${PROJECT_CPD_INST_OPERATORS} | egrep -v -e "(.+)/\1" -e Completed

    Pods are considered failing if they are in any state other than Running or Completed.

  2. Inspect the logs of the failed pods that were returned in the previous step. Replace <pod-name> with the name of the pod to inspect:
    oc describe pod <pod-name> -n ${PROJECT_CPD_INST_OPERATORS}
  3. Complete the appropriate step based on the output of the preceding step:

    Error Action
    The error contains Bundle unpacking failed. Complete the steps in Follow Red Hat guidance for resolving bundle unpacking errors.
    The error is a problem with pulling the image.
    1. Fix the error and delete the failed pods:
      oc delete pod <pod-name> -n ${PROJECT_CPD_INST_OPERATORS}
    2. Rerun the apply-olm command.

Check for orphaned CSVs and unbound subscriptions

  1. Determine which subscriptions do not have a corresponding CSV installed:
    for sub in $(oc get sub -n ${PROJECT_CPD_INST_OPERATORS} \
    --sort-by=.metadata.creationTimestamp -o name); \
    do \
    echo $sub = \
    $(oc get $sub -n ${PROJECT_CPD_INST_OPERATORS} \
    -o jsonpath='{.metadata.creationTimestamp}{"\t"}{.status.installedCSV}{"\n"}'); \
    done
  2. Review the output of the preceding command:
    • Subscriptions that are bound to a CSV have the following format:
      subscription.operators.coreos.com/<operator-name> = <timestamp> <csv-name>
    • Subscriptions that are not bound to a CSV have the following format:
      subscription.operators.coreos.com/<operator-name> = <timestamp>
  3. For each unbound subscription returned in the preceding step, check whether there are any unbound CSVs and delete them:
    1. Check for a CSV:
      oc get -n ${PROJECT_CPD_INST_OPERATORS} \
      --ignore-not-found -o name csv $(oc get -n ${PROJECT_CPD_INST_OPERATORS} packagemanifest $(oc get subscription <subscription-name> -n ${PROJECT_CPD_INST_OPERATORS} \
      -o jsonpath='{.spec.name}') -o jsonpath='{.status.channels[*].currentCSV}')
    2. Delete the CSV if it exists:
      oc delete csv <csv-name> -n ${PROJECT_CPD_INST_OPERATORS}
  4. For each unbound subscription returned in step 2, run the following command to delete the subscription:
    oc delete subscription <subscription-name> -n ${PROJECT_CPD_INST_OPERATORS}
  5. Restart the following pods in the openshift-operator-lifecycle-manager project:
    1. Restart the catalog-operator pods:
      oc delete pods -n openshift-operator-lifecycle-manager -l app=catalog-operator
    2. Restart the olm-operator pods:
      oc delete pods -n openshift-operator-lifecycle-manager -l app=olm-operator
  6. Confirm that the olm-operator pods are Running:
    oc get pods -n openshift-operator-lifecycle-manager -l app=olm-operator
  7. Complete the steps in Find and inspect remaining failed CSVs.

Find and inspect remaining failed CSVs

  1. Determine which CSVs are failing:
    for csv in $(oc get csv -n ${PROJECT_CPD_INST_OPERATORS}  \
    --sort-by=.metadata.creationTimestamp -o name); \
    do \
    echo -ne '.'; \
    csv_status=$(oc get $csv -n ${PROJECT_CPD_INST_OPERATORS} -o jsonpath='{.status.phase}'); \
    if [ "X${csv_status}" != "XSucceeded" ]; \
    then \
    echo; \
    echo "CSV did not succeed:  ${csv}  status: ${csv_status}"; \
    fi; \
    done
  2. Inspect any CSV that is not in the Succeeded phase:
    oc get csv <csv-name> -n ${PROJECT_CPD_INST_OPERATORS} -o yaml
  3. In the status section of the CSV YAML file, review the most recent messages for any obvious errors.

    For example, the following message does not include phase: Succeeded:

    lastTransitionTime: "2023-03-28T16:48:58Z"
    lastUpdateTime: "2023-03-28T16:48:59Z"
    message: 'installing: waiting for deployment ibm-cpd-ws-runtimes-operator to become ready: 
    deployment "ibm-cpd-ws-runtimes-operator" not available: Deployment does not have minimum availability.'

    In this case, the deployment did not come up. This can occur if the cluster does not have sufficient resources. You must investigate why the problem occurred.

  4. If you don't find any errors in the preceding step, find the InstallPlan that introduced the failed CSV:
    for ip in $(oc get ip -n ${PROJECT_CPD_INST_OPERATORS} -o name); \
    do \
    echo $ip:  $(oc get $ip -n ${PROJECT_CPD_INST_OPERATORS} -o yaml | grep <csv-name>); \
    done
  5. In the InstallPlan, search for messages that contain reason: InstallComponentFailed or phase: Failed.

    These messages might contain information that can help you identify the reason that the apply-olm command failed. Some common error messages you might see are:
    Missing required status field

    If you see a status...Required value message, a custom resource is missing a required status field.

    For example, you might see a message similar to the following example:
    message: 'error validating existing CRs against new CRD''s schema 
    for "fdbclusters.foundationdb.opencontent.ibm.com": 
    error validating custom resource against new schema 
    for FdbCluster zen/mdm-foundationdb-1655706402438170: [].status.stage_mirror: Required value'

    This message indicates that the status.stage_mirror field is missing from the mdm-foundationdb-1655706402438170 custom resource in the zen namespace. To resolve this problem, add the appropriate value to the status.stage_mirror field in the indicated custom resource. Then, retry the apply-olm command.

    Missing required spec field

    If you see a spec...Required value message, a custom resource is missing a required spec field.

    For example, you might see a message similar to the following example:
    message: 'error validating existing CRs against new CRD''s schema 
    for "paservices.pa.cpd.ibm.com": 
    error validating custom resource against new schema 
    for PAService zen/ibm-planning-analytics-service: [].spec.version: Required value'

    This message indicates that the spec.version field is missing from the ibm-planning-analytics-service custom resource in the zen namespace. To resolve this problem, update the version field to include the release version that you are upgrading from. Then, retry the apply-olm command.

  6. Complete the steps in Recheck the OLM operator logs.

Recheck the OLM operator logs

  1. Inspect the logs for the OLM operator:
    oc logs -n openshift-operator-lifecycle-manager \
    $(oc get pods -n openshift-operator-lifecycle-manager -lapp=catalog-operator -o name) | grep ${PROJECT_CPD_INST_OPERATORS}
  2. Complete the appropriate step based on the output of the preceding step:

    Error Action
    There are no recent errors. Rerun the apply-olm command.
    The error contains Bundle unpacking failed. Complete the steps in Follow Red Hat guidance for resolving bundle unpacking errors.
    Any other error was returned. Complete the steps in Contact IBM Support to clean up OLM artifacts.

Follow Red Hat guidance for resolving bundle unpacking errors

  1. Complete the steps in the Red Hat OpenShift known issue documentation.
  2. Rerun the apply-olm command.

Contact IBM Support to clean up OLM artifacts

You can clean up Cloud Pak for Data OLM artifacts to ensure that all stale subscriptions and CSVs are removed. The apply-olm command performs these cleanup actions, but it's possible that some OLM artifacts are removed only with explicit cleanup.

IBM Support can assist you with cleaning up Cloud Pak for Data OLM artifacts. Before you contact IBM Support, run the following command to collect information about the state of the pods:
cpd-cli manage collect-state \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS}

After you run the command, send the output information to IBM Support. If necessary, your IBM Support team will help you clean up Cloud Pak for Data OLM artifacts.