Troubleshooting upgrade on OpenShift
Review the following troubleshooting tips if you encounter a problem during API Connect upgrade on OpenShift, including as a component of IBM Cloud Pak for Integration (CP4I).
Subsystem is stuck in Pending
state with a reason of
PreUpgradeCheckInProgress
Before the subsystem microservices are upgraded, the operator triggers a set of pre-upgrade
checks that must pass for the upgrade to proceed. If one or more of the checks fail, the subsystem
status remains in Pending
state with a reason of
PreUpgradeCheckInProgress
. Check the status condition of the subsystem CR to
confirm the pre-upgrade check failed. The status.PreUpgradeCheck
property contains
a summary of the failed checks. Full logs for the checks that are carried out can be viewed in the
ConfigMap
referenced in the status.PreUpgradeCheck
property. The
pre-upgrade checks automatically retry until they successfully pass. If you are unable to rectify
the problem that causes a check to fail, then open an IBM support case.
Incorrect productVersion
of gateway pods after upgrade
When you upgrade on OpenShift, the rolling update might fail to start on gateway operand pods due
to a gateway peering issue, so that the productVersion
of the gateway pods is
incorrect even though the reconciled version on the gateway CR displays correctly. This problem
happens when there are multiple primary pods for gateway peering, and some of the peering pods are
assigned to each primary. There can be only one primary pod for gateway-peering; the existence of
multiple primaries prevents the upgrade from completing.
Verify that you have a gateway-peering issue with multiple primaries, and then resolve the issue by assigning a single primary for all gateway-peering pods. Complete the following steps:
- Determine whether multiple primary pods are causing the problem:
- Run one of the following commands to check the
productVersion
of each gateway pod.
oroc get pods -n apic_namespace <gateway_pods> -o yaml | yq .metadata.annotations.productVersion
where:oc get pods -n apic_namespace <gateway_pods> -o custom-columns="productVersion:.metadata.annotations.productVersion"
apic_namespace
is the namespace where API Connect is installed<gateway_pods>
is a space-delimited list of the names of your gateway peering pods
- Verify that the target version of API Connect supports the
productVersion
.For information about supported versions of DataPower Gateway, see tapic_upgrade_OpenShift_consider.html#tapic_upgrade_OpenShift_consider__gwy_versions.
- If any pod returns an incorrect value for the
productVersion
, check the DataPower operator logs for peering issues; look for messages similar to the following example:"level":"error","ts":"2023-10-20T16:31:47.824751077Z","logger":"controllers.DataPowerRollout","msg":"Unable to check primaries","Updater.id":"d3858eb3-4a34-4d72-8c51-e4290b6f9cda","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"mainWork","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:107"} {"level":"error","ts":"2023-10-20T16:32:17.262171148Z","logger":"controllers.DataPowerRollout","msg":"Attempting to recover from failed updater","Updater.id":"5524b8b5-8125-492e-aad7-3e54d107fdbf","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"checkWorkStatus","error":"Previous updater was unable to complete work: Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).checkWorkStatus\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/utils.go:389\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/reconcile.go:46\ngithub.ibm.com/datapower/datapower-operator/controllers/datapower.(*DataPowerRolloutReconciler).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/controllers/datapower/datapowerrollout_controller.go:50\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235"} {"level":"error","ts":"2023-10-20T16:32:17.856155403Z","logger":"controllers_datapowerrollout_gatewaypeering","msg":"Multiple primaries","DataPowerService":"production-gw","Namespace":"apic","Method":"isPrimary","podName":"production-gw-2","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).isPrimary\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:284\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).IsPrimaryByPodName\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:320\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:105"}
If you confirmed that gateway-peering (multiple primaries) is the issue, complete the following steps to resolve the issue.
- Run one of the following commands to check the
- Attach to a gateway pod:
- Run the following command to get a list of gateway
pods:
oc get pods -l app.kubernetes.io/managed-by=datapower-operator
The response looks like the following example:
NAME READY STATUS RESTARTS AGE datapower-operator-conversion-webhook-585d95cf87-mktrv 1/1 Running 0 2d22h datapower-operator-f4d8fbd85-5xt47 1/1 Running 0 2d22h production-gw-0 1/1 Running 0 2d22h production-gw-1 1/1 Running 0 2d22h production-gw-2 1/1 Running 0 2d22h
- Select a pod from the list and run the following command to attach to it using the
NAME:
oc -n ns attach -it <pod_NAME> -c datapower
- Run the following command to get a list of gateway
pods:
- Select the new primary:
- Run the following commands to switch to the API Connect configuration:
config; switch apiconnect
- Run the following command to show the current peering
status:
Example response:show gateway-peering-status
Address Name Pending Offset Link Primary Service port Monitor port Priority ------------ ---------------- ------- ---------- ---- ------- ------------ ------------ -------- 10.254.12.45 api-probe 0 19114431 ok no 16382 26382 2 10.254.12.45 gwd 0 1589848370 ok no 16380 26380 2 10.254.12.45 rate-limit 0 19148496 ok no 16383 26383 2 10.254.12.45 ratelimit-module 0 19117538 ok no 16386 26386 2 10.254.12.45 subs 0 26840742 ok no 16384 26384 2 10.254.12.45 tms 0 19115634 ok no 16381 26381 2 10.254.12.45 tms-external 0 19116159 ok no 16385 26385 2 10.254.28.61 api-probe 0 19114431 ok no 16382 26382 1 10.254.28.61 gwd 0 1589849802 ok yes 16380 26380 1 10.254.28.61 rate-limit 0 19148496 ok yes 16383 26383 1 10.254.28.61 ratelimit-module 0 19117538 ok yes 16386 26386 1 10.254.28.61 subs 0 26842695 ok yes 16384 26384 1 10.254.28.61 tms 0 19117734 ok yes 16381 26381 1 10.254.28.61 tms-external 0 19117965 ok yes 16385 26385 1 10.254.36.32 api-probe 0 19114739 ok yes 16382 26382 3 10.254.36.32 gwd 0 1589848370 ok no 16380 26380 3 10.254.36.32 rate-limit 0 19148496 ok no 16383 26383 3 10.254.36.32 ratelimit-module 0 19117538 ok no 16386 26386 3 10.254.36.32 subs 0 26840742 ok no 16384 26384 3 10.254.36.32 tms 0 19115634 ok no 16381 26381 3 10.254.36.32 tms-external 0 19116159 ok no 16385 26385 3
- Take note of the IP address that has the most primary
yes
options; this IP address will be the new primary for allfunctions
involved in gateway peering.In this example,
10.254.28.61
is the primary for the largest number offunctions
; howeverapi-probe
is not primary on this pod. In the following steps, setapi-probe
as primary on this pod.
- Run the following commands to switch to the API Connect configuration:
- For every function that needs the primary changed, attach to the primary pod and set the
function to primary:
- Log out of the current pod by pressing Ctrl+P, and then pressing Ctrl+Q.
- Run the following command to attach to a pod that needs its primary
updated:
oc -n ns attach -it <IP_of_pod> -c datapower
- Run the following command to switch to the API Connect configuration:
config; switch apiconnect
- Run the following command to show the current peering
status:
show gateway-peering-status
Verify that the function is set to the wrong primary IP and requires updating (in the example,
api-probe
requires updating). - Run the following command to change the primary for the function:
gateway-peering-switch-primary <function>
For example:gateway-peering-switch-primary api-probe
- Repeat this process for every function that needs its primary updated.
- Verify that all functions are now primary on the same IP address by running the following
command:
show gateway-peering-status
- Log out of the gateway pod (Ctrl+P, then Ctrl+Q).
Upgrade fails with an error message about a missing cert-manager
ibm-apiconnect
pod:cert-manager is required but not installed. Please install a cert-manager operator such as [cert-manager Operator for Red Hat OpenShift]. APIC instance being upgraded
Resolve the issue by completing the following steps:
- Install the cert-manager operator for Red Hat OpenShift:
- Log in to the OpenShift Container Platform web console.
- Click .
- In the filter box, type:
cert-manager Operator for Red Hat OpenShift
. - Select cert-manager Operator for Red Hat OpenShift and click Install.
- On the Install Operator page, complete the following steps:
- Update the Update channel if needed. The channel defaults to
stable-v1
, which installs the latest stable release of the cert-manager Operator for Red Hat OpenShift. - Select the Installed Namespace for the operator.
The default operator namespace is
cert-manager-operator
; if that namespace doesn't exist, it is created for you. - Select an Update approval strategy:
- Automatic: allow Operator Lifecycle Manager (OLM) to automatically update the operator when a new version is available.
- Manual: require a user with the appropriate credentials to approve all operator updates.
- Click Install.
- Update the Update channel if needed. The channel defaults to
- Verify the new cert-manager installation by
completing the following steps:
- Click .
- Verify that
cert-manager Operator for Red Hat OpenShift
is listed with a Status ofSucceeded
in thecert-manager-operator
namespace. - Verify that cert-manager pods are up and running with the following
command:
oc get pods -n cert-manager
For a successful installation, the response looks like the following example:
NAME READY STATUS RESTARTS AGE cert-manager-bd7fbb9fc-wvbbt 1/1 Running 0 3m39s cert-manager-cainjector-56cc5f9868-7g9z7 1/1 Running 0 4m5s cert-manager-webhook-d4f79d7f7-9dg9w 1/1 Running 0 4m9s
- Remove the obsolete IBM cert-manager operator by running the following
command:
oc delete certmanagers.operator.ibm.com default
DataPower operator pod stuck waiting for lock removal
datapower-operator
pod log that indicate that the pod
is waiting for a lock to be
removed:{"level":"info","ts":"2021-03-08T19:29:53.432Z","logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":"2021-03-08T19:29:57.971Z","logger":"leader","msg":"Leader pod has been deleted, waiting for garbage collection to remove the lock."}
The DataPower operator cannot be upgraded until this problem is resolved, see: DataPower operator documentation.
You see the denied: insufficient scope
error during an air-gapped
deployment
Problem: You encounter the
denied: insufficient scope
message while mirroring images during an air-gapped
installation or upgrade.
Reason: This error occurs when a problem is encountered with the entitlement key that is used for obtaining images.
Solution: Obtain a new entitlement key by completing the following steps:
- Log in to the IBM Container Library.
- In the Container software library, select Get entitlement key.
- After the Access your container software heading, click Copy key.
- Copy the key to a safe location.
Apiconnect operator pod fails
Problem: During installation (or
upgrade), the apiconnect
operator fails with the following message:
panic: unable to build API support: unable to get Group and Resources: unable to retrieve the complete list of server APIs: packages.operators.coreos.com/v1: the server is currently unable to handle the request
goroutine 1 [running]:
github.ibm.com/velox/apiconnect-operator/operator-utils/v2/apiversions.GetAPISupport(0x0)
operator-utils/v2/apiversions/api-versions.go:89 +0x1e5
main.main()
ibm-apiconnect/cmd/manager/main.go:188 +0x4ee
- Apiconnect operator is in crash loopback status
- Kube
apiserver
pods log the following information:E1122 18:02:07.853093 18 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: 401
- The IP logged here belongs to the package server pod present in the
openshift-operator-lifecycle-manager
namespace - Package server pods log the following error message:
./apis/packages.operators.coreos.com/v1
API call is being rejected with 401 issueE1122 18:10:25.614179 1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority I1122 18:10:25.614224 1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.243µs resp=401 UserAgent="Go-http-client/2.0" srcIP="10.128.0.1:41370":
- The problem is intermittent
- If you find the exact symptoms as described, the solution is to delete package server pods in
the
openshift-operator-lifecycle-manager
namespace. - New package server pods log the
200 Success
message for the same API call.
Unexpected behavior in Cloud Manager and API Manager UIs after upgrade
Stale browser cache issues can cause problems after an upgrade. To remedy this problem, clear your browser cache, and open a new browser window.