Troubleshooting upgrade on OpenShift

Review the following troubleshooting tips if you encounter a problem while installing or upgrading API Connect on OpenShift, including as a component of IBM Cloud Pak for Integration (CP4I).

Note: In the The Help icon.Help page of the Cloud Manager, API Manager, and API Designer user interfaces, there's a Product information tile that you can click to find out information about your product versions, as well as Git information about the package versions being used. Note that the API Designer product information is based on its associated management server, but the Git information is based on where it was downloaded from.

Incorrect productVersion of gateway pods after upgrade to 10.0.5.5

When upgrading to 10.0.5.5 on OpenShift, the rolling update might fail to start on gateway operand pods due to a gateway peering issue, so that the productVersion of the gateway pods is incorrect even though the reconciled version on the gateway CR displays as 10.0.5.5. This problem happens when there are multiple primary pods for gateway peering, and some of the peering pods are assigned to each primary. There can only be one primary pod for gateway-peering; the existence of multiple primaries prevents the upgrade from completing.

Verify that you have a gateway-peering issue with multiple primaries, and then resolve the issue by assigning a single primary for all gateway-peering pods. Complete the following steps:

  1. Determine whether multiple primary pods is causing the problem:
    1. Check the productVersion of each gateway pod to verify that it is 10.0.5.7 (the version of DataPower Gateway that was released with API Connect 10.0.5.5 by running one of the following commands:
      oc get po -n apic_namespace <gateway_pods> -o yaml | yq .metadata.annotations.productVersion
      or
      oc get po -n apic_namespace <gateway_pods> -o custom-columns="productVersion:.metadata.annotations.productVersion"
      
      where:
      • apic_namespace is the namespace where API Connect is installed
      • <gateway_pods> is a space-delimited list of the names of your gateway peering pods
    2. If any pod returns an incorrect value for the version, check the DataPower operator logs for peering issues; look for messages similar to the following example:
      "level":"error","ts":"2023-10-20T16:31:47.824751077Z","logger":"controllers.DataPowerRollout","msg":"Unable to check primaries","Updater.id":"d3858eb3-4a34-4d72-8c51-e4290b6f9cda","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"mainWork","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:107"}
      {"level":"error","ts":"2023-10-20T16:32:17.262171148Z","logger":"controllers.DataPowerRollout","msg":"Attempting to recover from failed updater","Updater.id":"5524b8b5-8125-492e-aad7-3e54d107fdbf","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"checkWorkStatus","error":"Previous updater was unable to complete work: Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).checkWorkStatus\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/utils.go:389\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/reconcile.go:46\ngithub.ibm.com/datapower/datapower-operator/controllers/datapower.(*DataPowerRolloutReconciler).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/controllers/datapower/datapowerrollout_controller.go:50\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235"}
      {"level":"error","ts":"2023-10-20T16:32:17.856155403Z","logger":"controllers_datapowerrollout_gatewaypeering","msg":"Multiple primaries","DataPowerService":"production-gw","Namespace":"apic","Method":"isPrimary","podName":"production-gw-2","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).isPrimary\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:284\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).IsPrimaryByPodName\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:320\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:105"}
      
  2. If you see an indication of peering issues, resolve the problem by running the gateway-peering-switch-primary command to manually switch primaries so that all gateway-peering objects have the same primary, as explained in gateway-peering-switch-primary in the DataPower documentation.

    Which gateway and gateway-peering objects need to be updated depends on your own deployment.

Upgrade stuck in Pending state

If your upgrade displays the Pending status and seems to be stuck, and the postgres-operator pod shows an error like the following example:
time="2023-07-17T06:41:09Z" level=error msg="refusing to upgrade due to unsuccessful resource removal" func="internal/operator/cluster.AddUpgrade()" file="internal/operator/cluster/upgrade.go:134" version=4.7.10

You can resolve the problem by deleting the clustercreate and upgrade pgtasks, which triggers a new attempt at the upgrade.

First, check the logs to verify the error:

  1. Run the following command to get the name of the pod:
    oc get po -n <APIC_namespace> | grep postgres-operator
  2. Run the following command to get the log itself, using the pod name from the previous step:
    oc logs <pod-name> -c operator -n <APIC_namespace>
  3. Once you confirm that the problem was caused by the resource-removal issue, delete the clustercreate and upgrade pgtasks.

    The following example shows the commands to get the pgtask names and then delete the clustercreate and upgrade pgtasks:

    oc get pgtasks
    NAME                                           AGE
    backrest-backup-large-mgmt-bce926ab-postgres   36h
    large-mgmt-bce926ab-postgres-createcluster     6d5h
    large-mgmt-bce926ab-postgres-upgrade           36h
    
    oc delete pgtasks large-mgmt-bce926ab-postgres-createcluster
    pgtask.crunchydata.com "large-mgmt-bce926ab-postgres-createcluster" deleted
    
    oc delete pgtasks large-mgmt-bce926ab-postgres-upgrade      
    pgtask.crunchydata.com "large-mgmt-bce926ab-postgres-upgrade" deleted

One or more pods in CrashLoopBackoff or Error state, and report a certificate error in the logs

In rare cases, cert-manager might detect a certificate in a bad state right after it has been issued, and then re-issues the certificate. If a CA certificate has been issued twice, the certificate that was signed by the previously issued CA will be left stale and can't be validated by the newly issued CA. In this scenario, one of the following messages displays in the log:
  • javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
  • Error: unable to verify the first certificate
  • ERROR: openssl verify failed to verify the Portal CA tls.crt, ca.crt chain signed the Portal Server tls.crt cert
    
Resolve the problem by completing the following steps:
  1. Use apicops (v10 version 0.10.57+ required) to validate the certificates in the system:
    apicops upgrade:stale-certs -n <namespace>
  2. If any certificate that is managed by cert-manager fails the validation, delete the stale certificate secret:
    oc delete secret <stale-secret> -n <namespace>

    Cert-manager automatically generates a new certificate to replace the one you deleted.

  3. Use apicops to make sure all certificates can be verified successfully:
    apicops upgrade:stale-certs -n <namespace>

Pgbouncer pod enters a CrashLoopBackoff state when upgrading to API Connect 10.0.5.3

This issue might be caused by a corrupt compliance entry in the pgbouncer.ini file. When the operator is upgraded to v10.0.5.3 but the ManagementCluster CR is not yet upgraded, the operator might update the pgbouncer.ini file in the PGBouncer secret with the older ManagementCluster CR's profile file, which does not contain any value for the compliance pool_size. As a result, the value gets incorrectly set to the string <no value>.

In the majority of cases this bad configuration update is temporary, until the ManagementCluster CR's version is also updated as part of the upgrade process and the problem is resolved. The issue will only become evident if the pgbouncer pod restarts prior to the ManagementCluster CR's version field being updated.

After the operator updates, if the other pods are restarted (specifically the pgbouncer pod) before the ManagementCluster CRs version is updated, then the pgbouncer pod will get stuck in a CrashLoopBackOff state due to the missing value for the compliance pool_size setting.

The following example shows what the error looks like in the pgbouncer logs:
Wed May 17 09:00:02 UTC 2023 INFO: Starting pgBouncer..
2023-05-17 09:00:02.897 UTC [24] ERROR syntax error in connection string
2023-05-17 09:00:02.897 UTC [24] ERROR invalid value "host=mgmt-a516d013-postgres port=5432 dbname=compliance pool_size=<no value>" for parameter compliance in configuration (/pgconf/pgbouncer.ini:7)
2023-05-17 09:00:02.897 UTC [24] FATAL cannot load config file

Resolve the issue by running the commands in the following script, where <namespace> is the API Connect instance's namespace if you deployed using the top-level APIConnectCluster CR, or is the management subsystem's namespace if you deploying using individual subsystem CRs.

NAMESPACE=<APIC-namespace>
BOUNCER_SECRET=<management-prefix>-postgres-pgbouncer-secret
TEMP_FILE=/tmp/pgbouncer.ini

**Step 1 - Common: Get the existing pgbouncer.ini file**
oc get secret -n $NAMESPACE $BOUNCER_SECRET -o jsonpath='{.data.pgbouncer\.ini}' | base64 -d > $TEMP_FILE

**Step 2 - Linux version: Update the file and use it to patch the Secret on the cluster**
sed 's/<no value>/20/' $TEMP_FILE | base64 -w0 | xargs -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]"

**Step 2 - Mac version:  Update the file and use it to patch the Secret on the cluster**
sed 's/<no value>/20/' $TEMP_FILE | base64 -b0 | xargs -S2000 -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]"

# Step 3 - Common: Restart pgbouncer to pick up the updated Secret configuration
oc delete pod <bouncer_pod_name> -n $NAMESPACE

You see the denied: insufficient scope error during an air-gapped deployment

Problem: You encounter the denied: insufficient scope message while mirroring images during an air-gapped upgrade.

Reason: This error occurs when a problem is encountered with the entitlement key used for obtaining images.

Solution: Obtain a new entitlement key by completing the following steps:

  1. Log in to the IBM Container Library.
  2. In the Container software library, select Get entitlement key.
  3. After the Access your container software heading, click Copy key.
  4. Copy the key to a safe location.

Apiconnect operator crashes

Problem: During upgrade, the Apiconnect operator crashes with the following message:

panic: unable to build API support: unable to get Group and Resources: unable to retrieve the complete list of server APIs: packages.operators.coreos.com/v1: the server is currently unable to handle the request

goroutine 1 [running]:
github.ibm.com/velox/apiconnect-operator/operator-utils/v2/apiversions.GetAPISupport(0x0)
	operator-utils/v2/apiversions/api-versions.go:89 +0x1e5
main.main()
	ibm-apiconnect/cmd/manager/main.go:188 +0x4ee
Additional symptoms:
  • Apiconnect operator is in crash loopback status
  • Kube apiserver pods log the following information:
    E1122 18:02:07.853093 18 available_controller.go:437] v1.packages.operators.coreos.com failed with:
     failing or missing response from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1:
     bad status from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: 401
  • The IP logged here belongs to the package server pod present in the openshift-operator-lifecycle-manager namespace
  • Package server pods log the following: /apis/packages.operators.coreos.com/v1 API call is being rejected with 401 issue
    E1122 18:10:25.614179 1 authentication.go:53] Unable to authenticate the request due to an error: x509: 
    certificate signed by unknown authority I1122 18:10:25.614224 1 httplog.go:90] 
    verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.243µs resp=401 
    UserAgent="Go-http-client/2.0" srcIP="10.128.0.1:41370":
  • Problem is intermittent
Solution:
  • If you find the exact symptoms as described, the solution is to delete package server pods in the openshift-operator-lifecycle-manager namespace.
  • New package server pods will log the 200 Success message for the same API call.