Troubleshooting upgrade on OpenShift

Review the following troubleshooting tips if you encounter a problem while installing or upgrading API Connect on OpenShift, including as a component of IBM Cloud Pak for Integration (CP4I).

In the The Help icon.Help page of the Cloud Manager, API Manager, and API Designer user interfaces, there's a Product information tile that you can click to find out information about your product versions, as well as Git information about the package versions being used. Note that the API Designer product information is based on its associated management server, but the Git information is based on where it was downloaded from.

Incorrect productVersion of gateway pods after upgrade

When upgrading on OpenShift, the rolling update might fail to start on gateway operand pods due to a gateway peering issue, so that the productVersion of the gateway pods is incorrect even though the reconciled version on the gateway CR displays correctly. This problem happens when there are multiple primary pods for gateway peering, and some of the peering pods are assigned to each primary. There can only be one primary pod for gateway-peering; the existence of multiple primaries prevents the upgrade from completing.

Verify that you have a gateway-peering issue with multiple primaries, and then resolve the issue by assigning a single primary for all gateway-peering pods. Complete the following steps:

  1. Determine whether multiple primary pods is causing the problem:
    1. Run one of the following commands to check the productVersion of each gateway pod.
      oc get po -n apic_namespace <gateway_pods> -o yaml | yq .metadata.annotations.productVersion
      or
      oc get po -n apic_namespace <gateway_pods> -o custom-columns="productVersion:.metadata.annotations.productVersion"
      
      where:
      • apic_namespace is the namespace where API Connect is installed
      • <gateway_pods> is a space-delimited list of the names of your gateway peering pods
    2. Verify that the productVersion is supported with the newer version of API Connect.

      For information on supported versions of DataPower Gateway, see Supported DataPower Gateway versions.

    3. If any pod returns an incorrect value for the productVersion, check the DataPower operator logs for peering issues; look for messages similar to the following example:
      "level":"error","ts":"2023-10-20T16:31:47.824751077Z","logger":"controllers.DataPowerRollout","msg":"Unable to check primaries","Updater.id":"d3858eb3-4a34-4d72-8c51-e4290b6f9cda","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"mainWork","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:107"}
      {"level":"error","ts":"2023-10-20T16:32:17.262171148Z","logger":"controllers.DataPowerRollout","msg":"Attempting to recover from failed updater","Updater.id":"5524b8b5-8125-492e-aad7-3e54d107fdbf","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"checkWorkStatus","error":"Previous updater was unable to complete work: Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).checkWorkStatus\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/utils.go:389\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/reconcile.go:46\ngithub.ibm.com/datapower/datapower-operator/controllers/datapower.(*DataPowerRolloutReconciler).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/controllers/datapower/datapowerrollout_controller.go:50\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235"}
      {"level":"error","ts":"2023-10-20T16:32:17.856155403Z","logger":"controllers_datapowerrollout_gatewaypeering","msg":"Multiple primaries","DataPowerService":"production-gw","Namespace":"apic","Method":"isPrimary","podName":"production-gw-2","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).isPrimary\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:284\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).IsPrimaryByPodName\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:320\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:105"}
      

    If you confirmed that gateway-peering (multiple primaries) is the issue, complete the following steps to resolve the issue.

  2. Attach to a gateway pod:
    1. Run the following command to get a list of gateway pods:
      oc get pods -l app.kubernetes.io/managed-by=datapower-operator

      The response looks like the following example:

      NAME                                                     READY   STATUS    RESTARTS   AGE
      datapower-operator-conversion-webhook-585d95cf87-mktrv   1/1     Running   0          2d22h
      datapower-operator-f4d8fbd85-5xt47                       1/1     Running   0          2d22h
      production-gw-0                                          1/1     Running   0          2d22h
      production-gw-1                                          1/1     Running   0          2d22h
      production-gw-2                                          1/1     Running   0          2d22h
    2. Select a pod from the list and run the following command to attach to it using the NAME:
      oc -n ns attach -it <pod_NAME> -c datapower
  3. Select the new primary:
    1. Run the following commands to switch to the API Connect configuration:
      config; switch apiconnect
    2. Run the following command to show the current peering status:
      show gateway-peering-status
      Example response:
       Address      Name             Pending Offset     Link Primary Service port Monitor port Priority 
       ------------ ---------------- ------- ---------- ---- ------- ------------ ------------ -------- 
       10.254.12.45 api-probe        0       19114431   ok   no      16382        26382        2        
       10.254.12.45 gwd              0       1589848370 ok   no      16380        26380        2        
       10.254.12.45 rate-limit       0       19148496   ok   no      16383        26383        2        
       10.254.12.45 ratelimit-module 0       19117538   ok   no      16386        26386        2        
       10.254.12.45 subs             0       26840742   ok   no      16384        26384        2        
       10.254.12.45 tms              0       19115634   ok   no      16381        26381        2        
       10.254.12.45 tms-external     0       19116159   ok   no      16385        26385        2        
       10.254.28.61 api-probe        0       19114431   ok   no      16382        26382        1        
       10.254.28.61 gwd              0       1589849802 ok   yes     16380        26380        1        
       10.254.28.61 rate-limit       0       19148496   ok   yes     16383        26383        1        
       10.254.28.61 ratelimit-module 0       19117538   ok   yes     16386        26386        1        
       10.254.28.61 subs             0       26842695   ok   yes     16384        26384        1        
       10.254.28.61 tms              0       19117734   ok   yes     16381        26381        1        
       10.254.28.61 tms-external     0       19117965   ok   yes     16385        26385        1        
       10.254.36.32 api-probe        0       19114739   ok   yes     16382        26382        3        
       10.254.36.32 gwd              0       1589848370 ok   no      16380        26380        3        
       10.254.36.32 rate-limit       0       19148496   ok   no      16383        26383        3        
       10.254.36.32 ratelimit-module 0       19117538   ok   no      16386        26386        3        
       10.254.36.32 subs             0       26840742   ok   no      16384        26384        3        
       10.254.36.32 tms              0       19115634   ok   no      16381        26381        3        
       10.254.36.32 tms-external     0       19116159   ok   no      16385        26385        3
    3. Take note of the IP that has the most primary yes options; this will be the new primary for all functions involved in gateway peering.

      In this example, 10.254.28.61 is the primary for the largest number of functions; however api-probe is not primary on this pod. In the following steps, set api-probe as primary on this pod.

  4. For every function that needs the primary changed, attach to the primary pod and set the function to primary:
    1. Log out of the current pod by pressing Ctrl+P, and then pressing Ctrl+Q.
    2. Run the following command to attach to a pod that needs its primary updated:
      oc -n ns attach -it <IP_of_pod> -c datapower
    3. Run the following command to switch to the API Connect configuration:
      config; switch apiconnect
    4. Run the following command to show the current peering status:
      show gateway-peering-status

      Verify that the function is set to the wrong primary IP and requires updating (in the example, api-probe requires updating).

    5. Run the following command to change the primary for the function:
      gateway-peering-switch-primary <function>
      For example:
      gateway-peering-switch-primary api-probe
    6. Repeat this process for every function that needs its primary updated.
  5. Verify that all functions are now primary on the same IP address by running the following command:
    show gateway-peering-status
  6. Log out of the gateway pod (Ctrl+P, then Ctrl+Q).

Upgrade stuck in Pending state

If your upgrade displays the Pending status and seems to be stuck, and the postgres-operator pod shows an error like the following example:
time="2023-07-17T06:41:09Z" level=error msg="refusing to upgrade due to unsuccessful resource removal" func="internal/operator/cluster.AddUpgrade()" file="internal/operator/cluster/upgrade.go:134" version=4.7.10

You can resolve the problem by deleting the clustercreate and upgrade pgtasks, which triggers a new attempt at the upgrade.

First, check the logs to verify the error:

  1. Run the following command to get the name of the pod:
    oc get po -n <APIC_namespace> | grep postgres-operator
  2. Run the following command to get the log itself, using the pod name from the previous step:
    oc logs <pod-name> -c operator -n <APIC_namespace>
  3. Once you confirm that the problem was caused by the resource-removal issue, delete the clustercreate and upgrade pgtasks.

    The following example shows the commands to get the pgtask names and then delete the clustercreate and upgrade pgtasks:

    oc get pgtasks
    NAME                                           AGE
    backrest-backup-large-mgmt-bce926ab-postgres   36h
    large-mgmt-bce926ab-postgres-createcluster     6d5h
    large-mgmt-bce926ab-postgres-upgrade           36h
    
    oc delete pgtasks large-mgmt-bce926ab-postgres-createcluster
    pgtask.crunchydata.com "large-mgmt-bce926ab-postgres-createcluster" deleted
    
    oc delete pgtasks large-mgmt-bce926ab-postgres-upgrade      
    pgtask.crunchydata.com "large-mgmt-bce926ab-postgres-upgrade" deleted

One or more pods in CrashLoopBackoff or Error state, and report a certificate error in the logs

In rare cases, cert-manager might detect a certificate in a bad state right after it has been issued, and then re-issues the certificate. If a CA certificate has been issued twice, the certificate that was signed by the previously issued CA will be left stale and can't be validated by the newly issued CA. In this scenario, one of the following messages displays in the log:
  • javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
  • Error: unable to verify the first certificate
  • ERROR: openssl verify failed to verify the Portal CA tls.crt, ca.crt chain signed the Portal Server tls.crt cert
    
Resolve the problem by completing the following steps:
  1. Use apicops (v10 version 0.10.57+ required) to validate the certificates in the system:
    apicops upgrade:stale-certs -n <namespace>
  2. If any certificate that is managed by cert-manager fails the validation, delete the stale certificate secret:
    oc delete secret <stale-secret> -n <namespace>

    Cert-manager automatically generates a new certificate to replace the one you deleted.

  3. Use apicops to make sure all certificates can be verified successfully:
    apicops upgrade:stale-certs -n <namespace>

Pgbouncer pod enters a CrashLoopBackoff state when upgrading to API Connect 10.0.5.3

This issue might be caused by a corrupt compliance entry in the pgbouncer.ini file. When the operator is upgraded to v10.0.5.3 but the ManagementCluster CR is not yet upgraded, the operator might update the pgbouncer.ini file in the PGBouncer secret with the older ManagementCluster CR's profile file, which does not contain any value for the compliance pool_size. As a result, the value gets incorrectly set to the string <no value>.

In the majority of cases this bad configuration update is temporary, until the ManagementCluster CR's version is also updated as part of the upgrade process and the problem is resolved. The issue will only become evident if the pgbouncer pod restarts prior to the ManagementCluster CR's version field being updated.

After the operator updates, if the other pods are restarted (specifically the pgbouncer pod) before the ManagementCluster CRs version is updated, then the pgbouncer pod will get stuck in a CrashLoopBackOff state due to the missing value for the compliance pool_size setting.

The following example shows what the error looks like in the pgbouncer logs:
Wed May 17 09:00:02 UTC 2023 INFO: Starting pgBouncer..
2023-05-17 09:00:02.897 UTC [24] ERROR syntax error in connection string
2023-05-17 09:00:02.897 UTC [24] ERROR invalid value "host=mgmt-a516d013-postgres port=5432 dbname=compliance pool_size=<no value>" for parameter compliance in configuration (/pgconf/pgbouncer.ini:7)
2023-05-17 09:00:02.897 UTC [24] FATAL cannot load config file

Resolve the issue by running the commands in the following script, where <namespace> is the API Connect instance's namespace if you deployed using the top-level APIConnectCluster CR, or is the management subsystem's namespace if you deploying using individual subsystem CRs.

NAMESPACE=<APIC-namespace>
BOUNCER_SECRET=<management-prefix>-postgres-pgbouncer-secret
TEMP_FILE=/tmp/pgbouncer.ini

**Step 1 - Common: Get the existing pgbouncer.ini file**
oc get secret -n $NAMESPACE $BOUNCER_SECRET -o jsonpath='{.data.pgbouncer\.ini}' | base64 -d > $TEMP_FILE

**Step 2 - Linux version: Update the file and use it to patch the Secret on the cluster**
sed 's/<no value>/20/' $TEMP_FILE | base64 -w0 | xargs -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]"

**Step 2 - Mac version:  Update the file and use it to patch the Secret on the cluster**
sed 's/<no value>/20/' $TEMP_FILE | base64 -b0 | xargs -S2000 -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]"

# Step 3 - Common: Restart pgbouncer to pick up the updated Secret configuration
oc delete pod <bouncer_pod_name> -n $NAMESPACE

You see the denied: insufficient scope error during an air-gapped deployment

Problem: You encounter the denied: insufficient scope message while mirroring images during an air-gapped upgrade.

Reason: This error occurs when a problem is encountered with the entitlement key used for obtaining images.

Solution: Obtain a new entitlement key by completing the following steps:

  1. Log in to the IBM Container Library.
  2. In the Container software library, select Get entitlement key.
  3. After the Access your container software heading, click Copy key.
  4. Copy the key to a safe location.

Apiconnect operator crashes

Problem: During upgrade, the Apiconnect operator crashes with the following message:

panic: unable to build API support: unable to get Group and Resources: unable to retrieve the complete list of server APIs: packages.operators.coreos.com/v1: the server is currently unable to handle the request

goroutine 1 [running]:
github.ibm.com/velox/apiconnect-operator/operator-utils/v2/apiversions.GetAPISupport(0x0)
	operator-utils/v2/apiversions/api-versions.go:89 +0x1e5
main.main()
	ibm-apiconnect/cmd/manager/main.go:188 +0x4ee
Additional symptoms:
  • Apiconnect operator is in crash loopback status
  • Kube apiserver pods log the following information:
    E1122 18:02:07.853093 18 available_controller.go:437] v1.packages.operators.coreos.com failed with:
     failing or missing response from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1:
     bad status from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: 401
  • The IP logged here belongs to the package server pod present in the openshift-operator-lifecycle-manager namespace
  • Package server pods log the following: /apis/packages.operators.coreos.com/v1 API call is being rejected with 401 issue
    E1122 18:10:25.614179 1 authentication.go:53] Unable to authenticate the request due to an error: x509: 
    certificate signed by unknown authority I1122 18:10:25.614224 1 httplog.go:90] 
    verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.243µs resp=401 
    UserAgent="Go-http-client/2.0" srcIP="10.128.0.1:41370":
  • Problem is intermittent
Solution:
  • If you find the exact symptoms as described, the solution is to delete package server pods in the openshift-operator-lifecycle-manager namespace.
  • New package server pods will log the 200 Success message for the same API call.