Troubleshooting upgrade on OpenShift
Review the following troubleshooting tips if you encounter a problem while installing or upgrading API Connect on OpenShift, including as a component of IBM Cloud Pak for Integration (CP4I).
In the Help
page of the Cloud Manager, API Manager, and API Designer user interfaces,
there's a Product information tile that you can click to find out information
about your product versions, as well as Git information about the package versions being used. Note
that the API Designer product
information is based on its associated management server, but the Git information is based on where
it was downloaded from.
Incorrect productVersion
of gateway pods after upgrade
When upgrading on OpenShift, the rolling update might fail to start on gateway operand pods due
to a gateway peering issue, so that the productVersion
of the gateway pods is
incorrect even though the reconciled version on the gateway CR displays correctly. This problem
happens when there are multiple primary pods for gateway peering, and some of the peering pods are
assigned to each primary. There can only be one primary pod for gateway-peering; the existence of
multiple primaries prevents the upgrade from completing.
Verify that you have a gateway-peering issue with multiple primaries, and then resolve the issue by assigning a single primary for all gateway-peering pods. Complete the following steps:
- Determine whether multiple primary pods is causing the problem:
- Run one of the following commands to check the
productVersion
of each gateway pod.
oroc get po -n apic_namespace <gateway_pods> -o yaml | yq .metadata.annotations.productVersion
where:oc get po -n apic_namespace <gateway_pods> -o custom-columns="productVersion:.metadata.annotations.productVersion"
apic_namespace
is the namespace where API Connect is installed<gateway_pods>
is a space-delimited list of the names of your gateway peering pods
- Verify that the
productVersion
is supported with the newer version of API Connect.For information on supported versions of DataPower Gateway, see Supported DataPower Gateway versions.
- If any pod returns an incorrect value for the
productVersion
, check the DataPower operator logs for peering issues; look for messages similar to the following example:"level":"error","ts":"2023-10-20T16:31:47.824751077Z","logger":"controllers.DataPowerRollout","msg":"Unable to check primaries","Updater.id":"d3858eb3-4a34-4d72-8c51-e4290b6f9cda","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"mainWork","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:107"} {"level":"error","ts":"2023-10-20T16:32:17.262171148Z","logger":"controllers.DataPowerRollout","msg":"Attempting to recover from failed updater","Updater.id":"5524b8b5-8125-492e-aad7-3e54d107fdbf","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"checkWorkStatus","error":"Previous updater was unable to complete work: Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).checkWorkStatus\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/utils.go:389\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/reconcile.go:46\ngithub.ibm.com/datapower/datapower-operator/controllers/datapower.(*DataPowerRolloutReconciler).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/controllers/datapower/datapowerrollout_controller.go:50\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235"} {"level":"error","ts":"2023-10-20T16:32:17.856155403Z","logger":"controllers_datapowerrollout_gatewaypeering","msg":"Multiple primaries","DataPowerService":"production-gw","Namespace":"apic","Method":"isPrimary","podName":"production-gw-2","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).isPrimary\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:284\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).IsPrimaryByPodName\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:320\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:105"}
If you confirmed that gateway-peering (multiple primaries) is the issue, complete the following steps to resolve the issue.
- Run one of the following commands to check the
- Attach to a gateway pod:
- Run the following command to get a list of gateway
pods:
oc get pods -l app.kubernetes.io/managed-by=datapower-operator
The response looks like the following example:
NAME READY STATUS RESTARTS AGE datapower-operator-conversion-webhook-585d95cf87-mktrv 1/1 Running 0 2d22h datapower-operator-f4d8fbd85-5xt47 1/1 Running 0 2d22h production-gw-0 1/1 Running 0 2d22h production-gw-1 1/1 Running 0 2d22h production-gw-2 1/1 Running 0 2d22h
- Select a pod from the list and run the following command to attach to it using the
NAME:
oc -n ns attach -it <pod_NAME> -c datapower
- Run the following command to get a list of gateway
pods:
- Select the new primary:
- Run the following commands to switch to the API Connect configuration:
config; switch apiconnect
- Run the following command to show the current peering
status:
Example response:show gateway-peering-status
Address Name Pending Offset Link Primary Service port Monitor port Priority ------------ ---------------- ------- ---------- ---- ------- ------------ ------------ -------- 10.254.12.45 api-probe 0 19114431 ok no 16382 26382 2 10.254.12.45 gwd 0 1589848370 ok no 16380 26380 2 10.254.12.45 rate-limit 0 19148496 ok no 16383 26383 2 10.254.12.45 ratelimit-module 0 19117538 ok no 16386 26386 2 10.254.12.45 subs 0 26840742 ok no 16384 26384 2 10.254.12.45 tms 0 19115634 ok no 16381 26381 2 10.254.12.45 tms-external 0 19116159 ok no 16385 26385 2 10.254.28.61 api-probe 0 19114431 ok no 16382 26382 1 10.254.28.61 gwd 0 1589849802 ok yes 16380 26380 1 10.254.28.61 rate-limit 0 19148496 ok yes 16383 26383 1 10.254.28.61 ratelimit-module 0 19117538 ok yes 16386 26386 1 10.254.28.61 subs 0 26842695 ok yes 16384 26384 1 10.254.28.61 tms 0 19117734 ok yes 16381 26381 1 10.254.28.61 tms-external 0 19117965 ok yes 16385 26385 1 10.254.36.32 api-probe 0 19114739 ok yes 16382 26382 3 10.254.36.32 gwd 0 1589848370 ok no 16380 26380 3 10.254.36.32 rate-limit 0 19148496 ok no 16383 26383 3 10.254.36.32 ratelimit-module 0 19117538 ok no 16386 26386 3 10.254.36.32 subs 0 26840742 ok no 16384 26384 3 10.254.36.32 tms 0 19115634 ok no 16381 26381 3 10.254.36.32 tms-external 0 19116159 ok no 16385 26385 3
- Take note of the IP that has the most primary
yes
options; this will be the new primary for allfunctions
involved in gateway peering.In this example,
10.254.28.61
is the primary for the largest number offunctions
; howeverapi-probe
is not primary on this pod. In the following steps, setapi-probe
as primary on this pod.
- Run the following commands to switch to the API Connect configuration:
- For every function that needs the primary changed, attach to the primary pod and set the
function to primary:
- Log out of the current pod by pressing Ctrl+P, and then pressing Ctrl+Q.
- Run the following command to attach to a pod that needs its primary
updated:
oc -n ns attach -it <IP_of_pod> -c datapower
- Run the following command to switch to the API Connect configuration:
config; switch apiconnect
- Run the following command to show the current peering
status:
show gateway-peering-status
Verify that the function is set to the wrong primary IP and requires updating (in the example,
api-probe
requires updating). - Run the following command to change the primary for the function:
gateway-peering-switch-primary <function>
For example:gateway-peering-switch-primary api-probe
- Repeat this process for every function that needs its primary updated.
- Verify that all functions are now primary on the same IP address by running the following
command:
show gateway-peering-status
- Log out of the gateway pod (Ctrl+P, then Ctrl+Q).
Upgrade stuck in Pending
state
Pending
status and seems
to be stuck, and the postgres-operator pod shows an error like the following
example:time="2023-07-17T06:41:09Z" level=error msg="refusing to upgrade due to unsuccessful resource removal" func="internal/operator/cluster.AddUpgrade()" file="internal/operator/cluster/upgrade.go:134" version=4.7.10
You can resolve the problem by deleting the clustercreate
and
upgrade
pgtasks, which triggers a new attempt at the upgrade.
First, check the logs to verify the error:
- Run the following command to get the name of the
pod:
oc get po -n <APIC_namespace> | grep postgres-operator
- Run the following command to get the log itself, using the pod name from the previous
step:
oc logs <pod-name> -c operator -n <APIC_namespace>
- Once you confirm that the problem was caused by the resource-removal issue, delete the
clustercreate
andupgrade
pgtasks.The following example shows the commands to get the pgtask names and then delete the
clustercreate
andupgrade
pgtasks:oc get pgtasks NAME AGE backrest-backup-large-mgmt-bce926ab-postgres 36h large-mgmt-bce926ab-postgres-createcluster 6d5h large-mgmt-bce926ab-postgres-upgrade 36h oc delete pgtasks large-mgmt-bce926ab-postgres-createcluster pgtask.crunchydata.com "large-mgmt-bce926ab-postgres-createcluster" deleted oc delete pgtasks large-mgmt-bce926ab-postgres-upgrade pgtask.crunchydata.com "large-mgmt-bce926ab-postgres-upgrade" deleted
One or more pods in CrashLoopBackoff
or Error
state, and
report a certificate error in the logs
-
javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
-
Error: unable to verify the first certificate
-
ERROR: openssl verify failed to verify the Portal CA tls.crt, ca.crt chain signed the Portal Server tls.crt cert
- Use
apicops
(v10 version 0.10.57+ required) to validate the certificates in the system:apicops upgrade:stale-certs -n <namespace>
- If any certificate that is managed by cert-manager fails the validation, delete the stale
certificate
secret:
oc delete secret <stale-secret> -n <namespace>
Cert-manager automatically generates a new certificate to replace the one you deleted.
- Use
apicops
to make sure all certificates can be verified successfully:apicops upgrade:stale-certs -n <namespace>
Pgbouncer pod enters a CrashLoopBackoff
state when upgrading to API Connect
10.0.5.3
This issue might be caused by a corrupt compliance entry in the
pgbouncer.ini file. When the operator is upgraded to v10.0.5.3 but the
ManagementCluster CR is not yet upgraded, the operator might update the
pgbouncer.ini file in the PGBouncer secret with the older ManagementCluster
CR's profile file, which does not contain any value for the compliance pool_size
.
As a result, the value gets incorrectly set to the string <no value>
.
In the majority of cases this bad configuration update is temporary, until the ManagementCluster CR's version is also updated as part of the upgrade process and the problem is resolved. The issue will only become evident if the pgbouncer pod restarts prior to the ManagementCluster CR's version field being updated.
After the operator updates, if the other pods are restarted (specifically the pgbouncer pod)
before the ManagementCluster CRs version is updated, then the pgbouncer pod will get stuck in a
CrashLoopBackOff
state due to the missing value for the compliance
pool_size
setting.
Wed May 17 09:00:02 UTC 2023 INFO: Starting pgBouncer..
2023-05-17 09:00:02.897 UTC [24] ERROR syntax error in connection string
2023-05-17 09:00:02.897 UTC [24] ERROR invalid value "host=mgmt-a516d013-postgres port=5432 dbname=compliance pool_size=<no value>" for parameter compliance in configuration (/pgconf/pgbouncer.ini:7)
2023-05-17 09:00:02.897 UTC [24] FATAL cannot load config file
Resolve the issue by running the commands in the following script, where
<namespace>
is the API Connect instance's namespace if you
deployed using the top-level APIConnectCluster CR, or is the management subsystem's namespace if you
deploying using individual subsystem CRs.
NAMESPACE=<APIC-namespace>
BOUNCER_SECRET=<management-prefix>-postgres-pgbouncer-secret
TEMP_FILE=/tmp/pgbouncer.ini
**Step 1 - Common: Get the existing pgbouncer.ini file**
oc get secret -n $NAMESPACE $BOUNCER_SECRET -o jsonpath='{.data.pgbouncer\.ini}' | base64 -d > $TEMP_FILE
**Step 2 - Linux version: Update the file and use it to patch the Secret on the cluster**
sed 's/<no value>/20/' $TEMP_FILE | base64 -w0 | xargs -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]"
**Step 2 - Mac version: Update the file and use it to patch the Secret on the cluster**
sed 's/<no value>/20/' $TEMP_FILE | base64 -b0 | xargs -S2000 -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]"
# Step 3 - Common: Restart pgbouncer to pick up the updated Secret configuration
oc delete pod <bouncer_pod_name> -n $NAMESPACE
You see the denied: insufficient scope
error during an air-gapped
deployment
Problem: You encounter the denied: insufficient scope
message while mirroring
images during an air-gapped upgrade.
Reason: This error occurs when a problem is encountered with the entitlement key used for obtaining images.
Solution: Obtain a new entitlement key by completing the following steps:
- Log in to the IBM Container Library.
- In the Container software library, select Get entitlement key.
- After the Access your container software heading, click Copy key.
- Copy the key to a safe location.
Apiconnect operator crashes
Problem: During upgrade, the Apiconnect operator crashes with the following message:
panic: unable to build API support: unable to get Group and Resources: unable to retrieve the complete list of server APIs: packages.operators.coreos.com/v1: the server is currently unable to handle the request
goroutine 1 [running]:
github.ibm.com/velox/apiconnect-operator/operator-utils/v2/apiversions.GetAPISupport(0x0)
operator-utils/v2/apiversions/api-versions.go:89 +0x1e5
main.main()
ibm-apiconnect/cmd/manager/main.go:188 +0x4ee
- Apiconnect operator is in crash loopback status
- Kube apiserver pods log the following
information:
E1122 18:02:07.853093 18 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: 401
- The IP logged here belongs to the package server pod present in the
openshift-operator-lifecycle-manager
namespace - Package server pods log the following:
/apis/packages.operators.coreos.com/v1
API call is being rejected with 401 issueE1122 18:10:25.614179 1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority I1122 18:10:25.614224 1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.243µs resp=401 UserAgent="Go-http-client/2.0" srcIP="10.128.0.1:41370":
- Problem is intermittent
- If you find the exact symptoms as described, the solution is to delete package server pods in
the
openshift-operator-lifecycle-manager
namespace. - New package server pods will log the
200 Success
message for the same API call.