Troubleshooting upgrade on OpenShift
Review the following troubleshooting tips if you encounter a problem while installing or upgrading API Connect on OpenShift, including as a component of IBM Cloud Pak for Integration (CP4I).
- Incorrect productVersion of gateway pods after upgrade to 10.0.5.5
- Upgrade stuck in Pending state
- One or more pods in CrashLoopBackoff or Error state, and report a certificate error in the logs
- Pgbouncer pod enters a CrashLoopBackoff state when upgrading to API Connect 10.0.5.3
- You see the denied: insufficient scope error during an air-gapped deployment
- Apiconnect operator crashes

Incorrect productVersion
of gateway pods after upgrade to 10.0.5.5
When upgrading to 10.0.5.5 on OpenShift, the rolling update might fail to start on gateway
operand pods due to a gateway peering issue, so that the productVersion
of the
gateway pods is incorrect even though the reconciled version on the gateway CR displays as 10.0.5.5.
This problem happens when there are multiple primary pods for gateway peering, and some of the
peering pods are assigned to each primary. There can only be one primary pod for gateway-peering;
the existence of multiple primaries prevents the upgrade from completing.
Verify that you have a gateway-peering issue with multiple primaries, and then resolve the issue by assigning a single primary for all gateway-peering pods. Complete the following steps:
- Determine whether multiple primary pods is causing the problem:
- Check the
productVersion
of each gateway pod to verify that it is 10.0.5.7 (the version of DataPower Gateway that was released with API Connect 10.0.5.5 by running one of the following commands:
oroc get po -n apic_namespace <gateway_pods> -o yaml | yq .metadata.annotations.productVersion
where:oc get po -n apic_namespace <gateway_pods> -o custom-columns="productVersion:.metadata.annotations.productVersion"
apic_namespace
is the namespace where API Connect is installed<gateway_pods>
is a space-delimited list of the names of your gateway peering pods
- If any pod returns an incorrect value for the version, check the DataPower operator logs for
peering issues; look for messages similar to the following
example:
"level":"error","ts":"2023-10-20T16:31:47.824751077Z","logger":"controllers.DataPowerRollout","msg":"Unable to check primaries","Updater.id":"d3858eb3-4a34-4d72-8c51-e4290b6f9cda","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"mainWork","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:107"} {"level":"error","ts":"2023-10-20T16:32:17.262171148Z","logger":"controllers.DataPowerRollout","msg":"Attempting to recover from failed updater","Updater.id":"5524b8b5-8125-492e-aad7-3e54d107fdbf","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"checkWorkStatus","error":"Previous updater was unable to complete work: Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).checkWorkStatus\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/utils.go:389\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/reconcile.go:46\ngithub.ibm.com/datapower/datapower-operator/controllers/datapower.(*DataPowerRolloutReconciler).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/controllers/datapower/datapowerrollout_controller.go:50\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235"} {"level":"error","ts":"2023-10-20T16:32:17.856155403Z","logger":"controllers_datapowerrollout_gatewaypeering","msg":"Multiple primaries","DataPowerService":"production-gw","Namespace":"apic","Method":"isPrimary","podName":"production-gw-2","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).isPrimary\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:284\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).IsPrimaryByPodName\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:320\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:105"}
- Check the
- If you see an indication of peering issues, resolve the problem by running the
gateway-peering-switch-primary
command to manually switch primaries so that all gateway-peering objects have the same primary, as explained in gateway-peering-switch-primary in the DataPower documentation.Which gateway and gateway-peering objects need to be updated depends on your own deployment.
Upgrade stuck in Pending
state
Pending
status and seems
to be stuck, and the postgres-operator pod shows an error like the following
example:time="2023-07-17T06:41:09Z" level=error msg="refusing to upgrade due to unsuccessful resource removal" func="internal/operator/cluster.AddUpgrade()" file="internal/operator/cluster/upgrade.go:134" version=4.7.10
You can resolve the problem by deleting the clustercreate
and
upgrade
pgtasks, which triggers a new attempt at the upgrade.
First, check the logs to verify the error:
- Run the following command to get the name of the
pod:
oc get po -n <APIC_namespace> | grep postgres-operator
- Run the following command to get the log itself, using the pod name from the previous
step:
oc logs <pod-name> -c operator -n <APIC_namespace>
- Once you confirm that the problem was caused by the resource-removal issue, delete the
clustercreate
andupgrade
pgtasks.The following example shows the commands to get the pgtask names and then delete the
clustercreate
andupgrade
pgtasks:oc get pgtasks NAME AGE backrest-backup-large-mgmt-bce926ab-postgres 36h large-mgmt-bce926ab-postgres-createcluster 6d5h large-mgmt-bce926ab-postgres-upgrade 36h oc delete pgtasks large-mgmt-bce926ab-postgres-createcluster pgtask.crunchydata.com "large-mgmt-bce926ab-postgres-createcluster" deleted oc delete pgtasks large-mgmt-bce926ab-postgres-upgrade pgtask.crunchydata.com "large-mgmt-bce926ab-postgres-upgrade" deleted
One or more pods in CrashLoopBackoff
or Error
state, and
report a certificate error in the logs
-
javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
-
Error: unable to verify the first certificate
-
ERROR: openssl verify failed to verify the Portal CA tls.crt, ca.crt chain signed the Portal Server tls.crt cert
- Use
apicops
(v10 version 0.10.57+ required) to validate the certificates in the system:apicops upgrade:stale-certs -n <namespace>
- If any certificate that is managed by cert-manager fails the validation, delete the stale
certificate
secret:
oc delete secret <stale-secret> -n <namespace>
Cert-manager automatically generates a new certificate to replace the one you deleted.
- Use
apicops
to make sure all certificates can be verified successfully:apicops upgrade:stale-certs -n <namespace>
Pgbouncer pod enters a CrashLoopBackoff
state when upgrading to API Connect
10.0.5.3
This issue might be caused by a corrupt compliance entry in the
pgbouncer.ini file. When the operator is upgraded to v10.0.5.3 but the
ManagementCluster CR is not yet upgraded, the operator might update the
pgbouncer.ini file in the PGBouncer secret with the older ManagementCluster
CR's profile file, which does not contain any value for the compliance pool_size
.
As a result, the value gets incorrectly set to the string <no value>
.
In the majority of cases this bad configuration update is temporary, until the ManagementCluster CR's version is also updated as part of the upgrade process and the problem is resolved. The issue will only become evident if the pgbouncer pod restarts prior to the ManagementCluster CR's version field being updated.
After the operator updates, if the other pods are restarted (specifically the pgbouncer pod)
before the ManagementCluster CRs version is updated, then the pgbouncer pod will get stuck in a
CrashLoopBackOff
state due to the missing value for the compliance
pool_size
setting.
Wed May 17 09:00:02 UTC 2023 INFO: Starting pgBouncer..
2023-05-17 09:00:02.897 UTC [24] ERROR syntax error in connection string
2023-05-17 09:00:02.897 UTC [24] ERROR invalid value "host=mgmt-a516d013-postgres port=5432 dbname=compliance pool_size=<no value>" for parameter compliance in configuration (/pgconf/pgbouncer.ini:7)
2023-05-17 09:00:02.897 UTC [24] FATAL cannot load config file
Resolve the issue by running the commands in the following script, where
<namespace>
is the API Connect instance's namespace if you
deployed using the top-level APIConnectCluster CR, or is the management subsystem's namespace if you
deploying using individual subsystem CRs.
NAMESPACE=<APIC-namespace>
BOUNCER_SECRET=<management-prefix>-postgres-pgbouncer-secret
TEMP_FILE=/tmp/pgbouncer.ini
**Step 1 - Common: Get the existing pgbouncer.ini file**
oc get secret -n $NAMESPACE $BOUNCER_SECRET -o jsonpath='{.data.pgbouncer\.ini}' | base64 -d > $TEMP_FILE
**Step 2 - Linux version: Update the file and use it to patch the Secret on the cluster**
sed 's/<no value>/20/' $TEMP_FILE | base64 -w0 | xargs -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]"
**Step 2 - Mac version: Update the file and use it to patch the Secret on the cluster**
sed 's/<no value>/20/' $TEMP_FILE | base64 -b0 | xargs -S2000 -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]"
# Step 3 - Common: Restart pgbouncer to pick up the updated Secret configuration
oc delete pod <bouncer_pod_name> -n $NAMESPACE
You see the denied: insufficient scope
error during an air-gapped
deployment
Problem: You encounter the denied: insufficient scope
message while mirroring
images during an air-gapped upgrade.
Reason: This error occurs when a problem is encountered with the entitlement key used for obtaining images.
Solution: Obtain a new entitlement key by completing the following steps:
- Log in to the IBM Container Library.
- In the Container software library, select Get entitlement key.
- After the Access your container software heading, click Copy key.
- Copy the key to a safe location.
Apiconnect operator crashes
Problem: During upgrade, the Apiconnect operator crashes with the following message:
panic: unable to build API support: unable to get Group and Resources: unable to retrieve the complete list of server APIs: packages.operators.coreos.com/v1: the server is currently unable to handle the request
goroutine 1 [running]:
github.ibm.com/velox/apiconnect-operator/operator-utils/v2/apiversions.GetAPISupport(0x0)
operator-utils/v2/apiversions/api-versions.go:89 +0x1e5
main.main()
ibm-apiconnect/cmd/manager/main.go:188 +0x4ee
- Apiconnect operator is in crash loopback status
- Kube apiserver pods log the following
information:
E1122 18:02:07.853093 18 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: 401
- The IP logged here belongs to the package server pod present in the
openshift-operator-lifecycle-manager
namespace - Package server pods log the following:
/apis/packages.operators.coreos.com/v1
API call is being rejected with 401 issueE1122 18:10:25.614179 1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority I1122 18:10:25.614224 1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.243µs resp=401 UserAgent="Go-http-client/2.0" srcIP="10.128.0.1:41370":
- Problem is intermittent
- If you find the exact symptoms as described, the solution is to delete package server pods in
the
openshift-operator-lifecycle-manager
namespace. - New package server pods will log the
200 Success
message for the same API call.